Swarm Intelligence Algorithms for Feature Selection: A Review

.


Introduction
Nowadays, researchers in machine learning and similar domains are increasingly recognizing the importance of dimensionality reduction of analyzed data.Not only that such high-dimensional data affect learning models, increasing the search space and computational time, but they can also be considered information poor [1,2].Furthermore, because of high-dimensional data and, consequently, a vast number of features, the construction of a suitable machine learning model can be extremely demanding and often almost fruitless.We are faced with the so-called curse of dimensionality that refers to a known phenomenon that arises when analyzing data in high-dimensional spaces, stating that data in high-dimensional space become sparse [2,3].
To overcome problems arising from the high-dimensionality of data, researchers use mainly two approaches.The first one is feature extraction, which comprises the creation of a new feature space with low dimensionality.The second one is feature selection that focuses primarily on removal of irrelevant and redundant features of the original feature set and, therefore, the selection of a small subset of relevant features.The feature selection problem with n features has 2 n possible solutions (or feature subsets).With each additional feature, the complexity doubles.Thus, a dataset with 1000 features per instance has 2 1000 possible feature subsets.
Feature Selection (FS) is tackled more and more with Swarm Intelligence (SI) algorithms [4,5], because SI has been proved as a technique which can solve NP-hard computational problems [6] and finding an optimal feature subset is certainly this kind of a problem.SI algorithms themselves have gained popularity in recent years [4], and, nowadays, we have by far surpassed its division into only the Particle Swarm Optimization [7] and Ant Colony Optimization [8], being the two best known SI approaches.In recent years, a considerable number of new swarm-inspired algorithms have emerged, as can be seen in the scope of surveys (e.g., [9,10]).
Despite the proven usage of SI algorithms for FS [11], a narrowed search for surveys in this field gives few results.Basir and Ahmad [12] conducted a comparison of swarm algorithms for feature selections/reductions.They focused only on six bio-inspired swarm algorithms, i.e., Particle Swarm Optimization (PSO), Ant Colony Optimization Algorithms (ACO), Artificial Fish Swarm Algorithms (AFSA), Artificial Bees Colony Algorithms (ABC), Firefly Algorithms (FA), and Bat Algorithms (BA).
In [13], Xue et al. provided a comprehensive survey of Evolutionary Computation (EC) techniques for FS.They divided EC paradigms into Evolutionary Algorithms (EAs), SI, and others.Among SI techniques, only PSO and ACO were analyzed in detail.Other briefly mentioned SI algorithms for FS include only the Bees algorithm and ABC.
However, even a quick literature review reveals that FS can also be addressed successfully by a number of other SI algorithms and approaches.Obviously, there is a gap in coverage of SI algorithms for FS.With this paper we try to fill this gap, providing a comprehensive overview of how SI can be used for FS, and what are the most used problems and domains where SI algorithms have proven to be especially successful.
In this paper, we try to find answers to three major research questions: RQ1: What is the current state of this research area?
1.1: Is this research area interesting for wide audience?1.2: How many papers that utilize SI approaches for FS tasks appeared in recent years?1.3: Are there any special real-world applications combining swarm intelligence and feature selection?RQ2: What are the main differences among SI approaches for FS? 2.1: Are there any special differences in architecture design, additional parameters, computational loads and operators?RQ3: How could one researcher develop an SI approach for FS tasks?3.1: What are the main steps and how can someone easily tailor a problem to existing frameworks?3.2: How can also researchers in this domain describe experimental settings and conduct fair comparisons among various algorithms?
The main contributions of this paper are, thus, as follows:

•
We performed a comprehensive literature review of swarm intelligence algorithms and propose an SI taxonomy based on the algorithms' biological inspiration.For all the reviewed algorithms we provide the year of its first appearance in scientific literature; the chronological perspective of SI evolution is presented in this manner.

•
We provide unified SI framework, which consists of five fundamental phases.Although the algorithms use different mechanisms, these five phases are common for all of them.The SI framework is used to explain different approaches to FS.

•
We provide a comprehensive overview of SI algorithms for FS.Using the proposed SI framework, different approaches, techniques, methods, their settings and implications for use are explained for various FS aspects.Furthermore, the most frequent datasets used for the evaluation of SI algorithms for FS are presented, as well as the most common application areas.Differences among EA approaches against SI approaches are also outlined.

•
We systematically outline guidelines for researchers who would like to use SI approaches for FS tasks in real-world.
This paper firstly provides a SI definition, SI framework, and proposed SI taxonomy in Section 2. FS and its variants are presented in Section 3. Section 4 comprises methodology for searching the relevant literature for this survey, and in Section 5, SI is applied to the problem of FS.Based on the proposed SI framework, each stage of the latter provides examples of proposed SI algorithms for FS.Applications that combine SI and FS are summarized in Section 6, and further divided based on used datasets or application areas.Section 7 brings the discussion of SI for FS algorithms, where issues, open questions, and guidelines for SI researchers are debated in detail.Finally, Section 8 contains the conclusions and plans for further work in the field.

Swarm Intelligence
The term Swarm Intelligence was introduced by Beni and Wang [14] in the context of cellular robotic systems.Nowadays, SI is commonly accepted as one branch of Computational Intelligence (CI) [15].
When speaking of SI, the five principles of SI paradigm must be fulfilled.The mentioned principles were defined by Milonas [16] already in 1994, and comprise the proximity principle, the quality principle, the principle of diverse response, the principle of stability, and the principle of adaptability.
Steady growth of the scientific papers on the topic of SI shows that SI is one of the most promising forms of CI.Contributing to this are undoubtedly the SI advantages [17].One of the clearest benefits is autonomy.Swarm does not have any outside management, but each agent in the swarm controls their behavior autonomously.Agent in this case represents a possible solution to a given problem.Based on that, we can deduce the second advantage, which is self-organization.The intelligence does not focus on the individual agent, but emerges in the swarm itself.Therefore, the solutions (agents) of the problem are not known in advance, but change themselves at the time of the running program.The self-organization plays an important role in adaptability.The latter is noticeable in changing environments where agents respond well to the mentioned changes, modify their behavior, and adapt to them autonomously.In addition, because there is no central coordination, the swarm is robust, as there is no single point of failure.Moreover, the swarm enables redundancy in which two other advantages hide.The first one is scalability, meaning that the swarm can consist of a few to up to thousands of agents and, either way, the control architecture remains the same.Moreover, because there is no single agent essential for the swarm, the SI advantage flexibility is fully satisfied.
Each SI algorithm must obey some fundamental phases.Thus, the SI framework (Figure 1) can be defined as follows: 1.
Define stop condition 3.
Update and move agents 5.
Return the global best solution Before the initialization phase, the values of the algorithm parameters should be defined.For example, in PSO, the initial population of feasible solutions is generated and the values of the following required parameters are set: cognitive and social coefficients c 1 and c 2 , inertia weight ω, and lower and upper bounds of velocity v min and v max , respectively.The evolutionary process then begins with initialization; some initialization strategies are presented in Section 5.2.

Update and move agents
Return the global best solution

Yes
No Defined stop condition reached?The primary function of the stop condition is to stop the execution of the algorithm.As shown in Section 5.3, the stopping criterion is either single or composed.
The third phase of the SI framework that evaluates the fitness function is responsible for the evaluation of the search agents.Similar to the stop condition, the fitness function is either single, i.e., some basic metric such as classification accuracy, or composed.The fitness functions used in FS are presented in Section 5.4.
Agents in an SI algorithm update and move, based on some mathematical background of the algorithm.The moves and discretized positions of agents for FS are presented in Section 5.5.
The result of an SI algorithm is the best search agent.
Based on the biological inspiration of SI algorithms, the taxonomy of SI is proposed (see Figure 2).The latter does not include other bio-inspired optimization algorithms [18], which do not have the swarm concept, e.g., monkey search [19], dolphin echolocation [20], and jaguar algorithm [21].
Lately, researchers tackle some optimization problems also with Agent Swarm Optimization (ASO).The benefits of such an approach are a multi-agent point of view, a coexistence of different agent breeds and their interaction, and agents endowment with problem-specific rules [22,23].

Feature Selection
In Machine Learning (ML), the belief that "the more the variables, the better the performance" has not been acceptable for some time now; thus, the application of variable (feature) selection techniques has been gaining popularity fast in the field of Data Mining [24].Although there are different techniques and various approaches in ML (supervised classification or regression, unsupervised clustering, etc.), a dataset of instances, represented by values of a number of attributes (or features), is common to them all.The main goal of ML algorithms is then to find the best possible mapping between features' values and the desired result (a classification or a regression model, a set of clusters, etc.) by searching for patterns in data.
When the dimensionality of data increases, the computational cost also increases, usually exponentially.To overcome this problem, it is necessary to find a way to reduce the number of features in consideration.Generally, two techniques are often used: feature (subset) selection, and feature extraction.Feature extraction creates new variables as combinations of others to reduce the dimensionality.FS, on the other hand, works by removing features that are not relevant or are redundant (Figure 3).In this manner, it searches for a projection of the data onto fewer variables (features) which preserves the information as much as possible and generally allows ML algorithms to find better mappings more efficiently.FS algorithms can be separated into three categories [25]: 1. filter models; 2.
A filter model evaluates features without utilizing any ML algorithms by relying on the characteristics of data [26].First, it ranks features based on certain criteria-either independently of each other or by considering the feature space.Second, the features with highest rankings are selected to compose a feature subset.Some of the most common filter methods are Information Gain, Mutual Information, Chi2, Fisher Score, and ReliefF.
The major disadvantage of the filter approach is that it ignores the effects of the selected features on the performance of the used ML algorithm [27].A wrapper model builds on an assumption that the optimal feature set depends on why and how they will be used in a certain situation and by a certain ML algorithm.Thus, by taking the model hypothesis into account by training and testing in the feature space, wrapper models tend to perform better in selecting features.This, however, leads to a big disadvantage of wrapper models-the computational inefficiency.
Finally, an embedded model embeds FS within a specific ML algorithm.In this manner, embedded models have the advantages of both wrapper models, i.e., they include the interaction with the ML algorithm, and filter models, i.e., they are far less computationally intensive than wrapper methods [24].The major disadvantage of embedded methods, however, is that an existing ML algorithm needs to be specifically redesigned, or even a new one developed, to embed the FS functionality alongside its usual ML elements.For this reason, the existing and commonly accepted ML algorithms cannot be used directly, which significantly increases the effort needed to perform the task.

Methodology
In this section, strategies for searching the relevant literature for this survey are described in details.Firstly, we have selected the most appropriate keywords and built a search string.Selected keywords were "swarm intelligence" and "feature selection", while the search string was: "swarm intelligence" + "feature selection".Search was limited on the major databases that mostly cover computer science related content: ACM Digital Library, Google Scholar, IEEE Xplore, ScienceDirect, SpringerLink, and Web of Science.Results obtained from those databases are presented in Table 1.The next stage involved pre-screening the abstracts of returned results.The main aim of pre-screening was to remove redundant data (some papers were returned in more than one database) along with inappropriate results.Inappropriate results covered some papers, where authors claimed they used SI concept, but our study revealed that there were only bio-inspired algorithms without SI concept.
Altogether, we selected 64 papers that were studied in deep details.Findings of this study are presented in the next sections.
Papers comprising SI and FS have had almost exponential growth over the past 17 years.The growth is visible in Figure 4, where 1170 papers were written in 2017 on this topic alone.Nonetheless, the biggest increase is noted in 2017, when 206 more papers were written than the year before.The statistics were obtained from Google Scholar in June 2018.

Swarm Intelligence for Feature Selection
SI algorithms are a natural choice to be used for optimizing the feature subset selection process within a wrapper model approach.Wrapper models utilize a predefined ML algorithm (i.e., a classifier) to evaluate the quality of features and representational biases of the algorithm are avoided by the FS process.However, they have to run the ML algorithm many times to assess the quality of selected subsets of features, which is very computationally expensive.Thus, the SI algorithms can be used to optimize the process of FS within a wrapper model, their aim being to determine which subset of all possible features provides the best predictive performance when used with a predefined ML algorithm.In this manner, for an FS algorithm, the actual types and values of features are not important; they work well for either homogeneous or heterogeneous data that are continuous, discrete and/or mixed, or even represented as images or videos [28][29][30][31][32][33].As classification is the most common ML task, we use it to demonstrate the FS approach; any other ML technique can be used similarly.
Given a predefined classification algorithm, a typical wrapper method iterates over the following two steps until the desired goal is achieved (i.e., certain accuracy is reached):

•
Select a subset of features.

•
Evaluate the selected subset by performing the selected classification algorithm.
A general framework for a wrapper model of FS for classification [27] is shown in Figure 5.
In the next subsections, the SI algorithms are analyzed based on the SI framework from Section 2. Only those derivations of the analyzed algorithms that comprise an FS have been considered.

Agent Representation
The continuous and binary versions of the SI algorithms have been used for FS.The main difference in SI approaches is in their representation of the search agent.Generally, in the binary versions of the algorithms, an agent is formed as a binary string, where the selected feature is represented with number 1 and the not selected feature with 0. In the continuous approaches, the search agent is represented with real values, and the selection of features is based on a threshold θ, 0 < θ < 1.The feature is selected only if its value is greater than the value assigned as a threshold.
Based on that, there are mainly three different encoding mechanisms.One of the most used is the binary encoding [30,34], followed by continuous encoding [29], and the hybrid combination of the two [31].
A particular case is the ACO algorithm, in which the features are represented as graph nodes.When an ant moves and visits the node in the graph, the node becomes part of the feature subset.Therefore, each feature subset is represented by an ant.

Initialize Population
Initialization of a population is the primary step in any SI technique, as shown in Section 2. Based on analyzed articles, six different strategies in population initialization are used commonly.
The first one is the initialization of the population with the full set of features (all features selected), also known as the backward selection.The opposite of this approach is the initialization of the population with an empty set of features (no feature is selected), also known as forward selection.Randomly generated swarm population is the third strategy, in which the choice of selecting a feature is random.Based on analyzed articles, this is also the one chosen most commonly.
The following strategies are some derivatives of the above approaches.One of them is a strategy in which only a small set of features is selected, and, with this, the forward selection is mimicked [32,34].Swarm initialization with a large number of selected features mimicking the backward selection [32,34] is the next one.Some approaches use continuous swarm intelligence techniques with the floating-point values of the initial population of agents [29].
In the ACO algorithm, the initialization is done mainly in such a way that the first node in the graph is selected randomly, and the following ones are based on the probabilistic transition rule [35].There are also some variations of the graph representation in the way its nodes are connected.Although most algorithms use fully connected graph [35,36], which means the nodes are fully connected to each other, some researchers (e.g., [37]) connected each node with only two other nodes.
In [38], researchers initialized food sources in ABC with only one selected feature.The authors in [34,39] used four initialization approaches, i.e., normal, small, big, and mixed.
Another different approach was used in [40], where researchers used the Minimum Redundancy Maximum Relevance (MRMR) initialization which combines relevance with the target class and redundancy to other features.

Defined Stopping Condition
Stopping, or termination condition/criteria, is a parameter which is defined to stop the execution of an algorithm.In most SI algorithms, the stopping condition is either one, or is composed of two, separate conditions.

Fitness Function
Each solution X i = [x i1 , x i2 , . . ., x id ] (d and i = 1, 2, . . ., n represents the dimension of solutions and number of solutions n, respectively) in the used SI technique must be evaluated by some fitness function f ( X i ).Based on the algorithms presented in Section 2, the solution comprises mainly the search agent.For example, in the PSO, the search agent is a particle, in GSO a glow-worm, and in RIO a cockroach.The difference is, for example, in the ABC algorithm, where the solutions represent the food sources.
Fitness function is many times expressed as a classification accuracy.Many researchers use SVMs as the classifier method to evaluate solutions (e.g., [33,38,[50][51][52]). Researchers in [46] proposed fitness function that is the average of the classification error rate obtained by three classifiers, SVM variations, υ-SVM, C-SVM and LS-SVM in the DA algorithm.The average results were used to prevent generation of a feature subset based on a single classifier.In [53], the C-SVM classifier using the Radial Basis Function kernel (RBF) was applied in the FA algorithm.
Classification performance is also evaluated frequently with KNN.In [36], the authors first evaluated each ant in ACO through KNN, and then used Mean Square Error (MSE) for performance evaluation.For the same algorithm, i.e., ACO, the researchers in [35] used KNN (K = 1).The classification performance was also evaluated with KNN (K = 5) in the binary ABC algorithm [54].
In [44], researchers proposed fitness function where each solution is measured with the Overall Classification Accuracy (OCA) in which the 1-Nearest Neighbor (1NN), K-Nearest Neighbor (KNN), or Weighted K-Nearest Neighbor (WKNN) is called.The value of K is changed dynamically based on iterations of the algorithm.With this approach, researchers enabled the diversity of the solutions in each iteration.
Composed weighted fitness function is also used in the GWO algorithm [40], which comprises an error rate and a number of selected features.Some other approaches comprise classification accuracy of Optimum-Path Forest (OPF) in the BA algorithm [55], Root Mean Squared Error (RMSE) in FA [56], Greedy Randomized Adaptive Search Procedure (GRASP) algorithm for clustering in ABC [41], and Back Propagation Neural Network (BPNN) in RIO [49] and ACO [57].
Composed fitness function based on rough sets and classification quality is proposed in [58].Packianather and Kapoor [48] proposed fitness function (Equation (1)), which is composed of Euclidean Distance (ED) used by the Minimum Distance Classifier (MDC) in the classifying process and some other parameters, i.e., weighted factors k 1 and k 2 , and the number of selected features and total number of features n s and n t , respectively.
Rough set reduct was also used in one of the first ACO algorithms for FS [59].
Authors in [47] proposed composite fitness function, which uses the classification accuracy and number of selected features nbFeat (Equation ( 2)).SVM and NB classifiers obtained the classification accuracy, respectively.
In [43], the objective function is composed of the classification accuracy rate of KNN classifier (K = 7) and the measure class separability.Based on that, five variants were proposed, of which two are weighted sums.The first three use classification accuracy ( f 1 ), Hausdorff distance ( f 2 ), and Jeffries-Matusita (JM) distance ( f 3 ), respectively; the fourth is the weighted sum of ( f 1 ) and ( f 2 ); and the last one is the weighted sum of ( f 1 ) and ( f 3 ).
The weighted fitness function is also used in [45].The first part comprises the classification accuracy obtained by a KNN classifier (K=1) and the second part the number of selected features (Equation ( 3)).
where n is the number of selected features, N is the number of all features, w 1 is the weight for classification accuracy, w 2 is the weight for selected features.
A similar approach is used in [34], where researchers proposed two variations of GWO, but in both, the goal of the fitness function is to maximize classification accuracy and lower the number of selected features.Here also has been applied the KNN classifier (K = 5).

Update and Move Agents
In the ABC algorithm, the neighborhood exploration is done by a small perturbation using Equation ( 4) where v ij is the candidate food source, parameter x ij represents the food source, indices j and k are random variables and i is the index of the current food source.Parameter Φ ij is a real number in the interval (−1, 1).In the proposed algorithm for FS based on ABC [38], the Modification Rate (MR), also called perturbation frequency, is used to select features with Equation ( 5): If the randomly generated value between 0 and 1 is lower than the MR, the feature is added to the feature subset.Otherwise, it is not.In [41], the update is done using the same equation as in the initially proposed ABC algorithm.Therefore, the sigmoid function is used (Eqaution (9)) for the transformation of values.
In [46], researchers proposed a two-sided DA algorithm in which the update of dragonflies' position is done using Equation ( 6): The meaning of the components in Equation ( 6) are in the following order: Separation, alignment, cohesion, attraction, and distraction of the ith dragonfly, all summed with the inertia weight and the step.Discretization of variables' values is done using Equation (7): In Equation ( 7), the T(∆X i,t+1 ) set the step for variables is based on Equation (8): The researchers in [42] used sigmoid function sig ij for the creation of the probability vector in the FA algorithm.The final position update is done after that with the Sigmoid function (Equation ( 9)) where the rand is defined in [0, 1].
Instead of a Sigmoid function, researchers in [58] proposed the tanh function to improve the performance of an FA algorithm.The tanh function is responsible for scaling values to binary ones based on Equation (10) in which are the parameters i = 1, . . ., n (n = number of fireflies) and k = 1, . . ., m (m = number of dimensions).
The difference between Sigmoid and tanh functions is presented in Figure 6.The movement of fireflies in the FA algorithm [53] is done using Equation (11): in which X i (t) and X j (t) determine the position of the less bright and brighter fireflies, respectively.Parameter β comprises the attractiveness component.Discretization is done using the Sigmoid function.
In [50], the change of glow-worms' positions is done using a step size calculated for each feature.The primary goal of glow-worms is to minimize the distance between features.The changes of chosen features are done in the GSO algorithm with Equation ( 12 where orig_dim and new_dim are original and new dimensions, respectively, and rand = (0, 1).
Updating the roach location in the RIO algorithm is a composite of three parts [49].The first part of Equation ( 13) is velocity component w ⊕ MT(x i ) updated with mutation operator.The cognition part C o ⊕ CR(w ⊕ MT(x i ), p i ) comprises individual thinking of a cockroach, and is updated with a crossover or two point crossover operator.The last, social component, ) comprises a collaboration among the swarm, and is updated with a crossover operator.The Sigmoid function is used for transformation in continuous space, and is then compared to the randomly generated value in (0 ,1) in BBMO [44].
The movement of bats is affected by velocity and position [55].To change the velocity in BA, the frequency must be updated first (Equation ( 14)).The velocity in the previous iteration, combined with the global best solution found so far for decision variable j ( xj ), the current value of decision value (x j i ), and modified frequency, updates the new velocity shown in Equation ( 14) The position change is finally done based on the Sigmoid function.The researchers in [34] proposed two variations of the GWO.In the first one, the update is done based on the calculated crossover between solutions (Equation ( 15)).
The parameters of the crossover are obtained based on the wolf movement towards the three best-known solutions, i.e., alpha, beta, and delta wolves.The second approach is a binary position vector changed by the Sigmoid function.
Researchers in [43] also used the GWO algorithm and updated the position according to Equation (16), in which X 1 to X 3 are position vectors of best-known solutions/wolves.The new position is then discretized with Equation ( 17) To sum up, based on the analyzed work, it has been shown that progress in the FS area has been made.There is a noticeable increasing extent of the proposed SI algorithms to solve FS problems, as well as mixed algorithms.In additiong, there are increasing proposed novel fitness functions.An increase of determining the optimum parameter settings of algorithms has been seen in the already established algorithms.

Applications of Feature Selection with Swarm Intelligence
All proposed algorithms must be evaluated with either benchmark datasets or applied to real-life problems.Thus, the following section is organized in two subsections.Section 6.1 comprises research articles that only use available datasets for testing the algorithms, while Section 6.2 comprises research articles that apply proposed algorithms to the particular research field.
Artificially generated benchmarks, e.g., Griewank, Rosenbrock, and Ackley, are not addressed in a separate subsection because there are not many works that use this approach.Research articles in which the benchmark functions are subjected to testing are, for example, [60][61][62].
The remaining group is composed of mixed datasets.In [42], researchers used eleven large datasets and one small.Datasets from 73 up to 5000 features are used in [34].Some other researchers in this group are, for example, [28,54,55,67,68].
In Table 2, the research articles are collected and ordered based on used datasets' dimensions and taxonomy.We left out Group Hunting, because we were unable to find any work which included the HS algorithm for FS.

Small
Large Mixed ∑

Application Areas
There are a variety of application areas in which the presented algorithms are used for FS.Based on that, we defined the following three main groups of application domains in which the algorithms fall (Figure 8): 1.
Electrical and Electronic Engineering 3.
Computer Engineering The application field regulates all presented research articles and then classifies them based on the used algorithms in Table 3.For the same reason as in Table 2, we excluded the Group Hunting section.
Some studies [45,97] use the improved SFLA algorithm on high-dimensional biomedical data for disease diagnosis.
Image based FS is addressed in [99].The researchers proposed a method for classifying lung Computed Tomography (CT) images, in which the KH algorithm was used alongside multilayer perceptron neural networks.
With binary FSS, the researchers in [95] proposed the application which predicts the readmission of Intensive Care Unit (ICU) patients within the time window of 24-72 h after being discharged.
Methodology IGWO-KELM for Medical Diagnosis was proposed in [92].An Improved GWO algorithm (IGWO) is used to select the most informative subset of features, which are used in the second stage (KELM).In the latter, the prediction based on selected features in the previous stage is made.
For predicting metabolic syndrome disease, the adaptive binary PSO learnable Bayesian classifier is proposed in [90].
Research [49] addresses the problem of selecting textural features for determining the water content of cultured Sunagoke moss.The proposed approach comprises the RIO algorithm, in which the feature subset is evaluated with BPNN.A machine vision-based system is also used in addition to the latter in [86].To improve the leap reading recognition, the BPSO is used with Euclidean Distances of subsets as the fitness function.Researchers in [48] proposed the BA algorithm for improvement in a wood defect classification.
The following studies have been done in the field of Electrical and Electronic Engineering.Authors in [56] used the FA algorithm in the near infrared (NIR) spectroscopy to find the most informative wavelengths.In the subarea of sensors, the researchers in [100] proposed a novel KH algorithm to improve gas recognition by electronic noses.Hyperspectral images are addressed in FS in [43,87].The first paper uses the GWO algorithm for band selection.For hyperspectral image classification, researchers in the second article proposed the PSO-based automatic relevance determination and feature selection system.
In the Computer Engineering field, the intrusion detection is one of the most intriguing problems for researchers.It is addressed in [96] with the AFSA algorithm, in [47] with the BA* algorithm, and in [94] with the GWO algorithm.The dynamic subfield is also image steganalysis.The researchers in [33] presented a novel method named IFAB based on an ABC algorithm for feature selection, and the authors in [53] used the FA algorithm for blind image steganalysis.Some other problems tackled in this field comprise fault diagnosis of complex structures with the BFO algorithm [64], skin detection based background removal for enhanced face recognition with adaptive BPSO [89], identifying malicious web domains with BPSO in [88], and web page classification with ACO [83].

Discussion
Alone within this survey, we analyzed 22 different original SI algorithms for FS and, altogether, 64 of their variants.However, for the vast majority of all applications, only a few of the algorithms are used.In this manner, the most frequently used algorithm (PSO) is used in almost 47% of all cases, while the four most frequently used algorithms cover more than three quarters of cases (79%), and the top seven algorithms even cover 90% of all cases (see Table 4).On the basis of the above, a reasonable doubt arises of whether one should even pursue the quest of analyzing the broad spectre of existing SI algorithms for FS.However, the fact is that some of the latest algorithms, e.g., BCO, CS, FA, and GWO, are also used in combination with some other techniques and present very promising results in FS.Some of the novel algorithms still provide enough improvements to be considered useful and produce an impact.Some of the selected algorithms and their components are sorted by proposed taxonomy in Table 5.
Thus, the goal of this paper is to support a researcher in choosing the most adequate algorithm and its setting for his/her data mining tasks and endeavors.For this purpose, alongside the characteristics of abundance of SI algorithms for FS, it may well help to be aware of various practical implications of these algorithms and their use.As some researchers report different modifications, adaptations and hybridizations of the basic algorithms, which have shown promising practical results, we collected some of them.
The introduction of different local search mechanisms is used to improve specific aspects of SI algorithms.For example, in [101], the Lévy flight is included to avoid premature convergence of the Shuffled Frog Leaping Algorithm (SFLA) for cancer classification.In [102], on the other hand, the authors proposed to combine tabu search (TS) and binary particle swarm optimization (BPSO) for feature selection, where BPSO acts as a local optimizer each time the TS has been run for a single generation.
In some cases, researchers combine specific characteristic of different SI algorithms to improve the overall efficiency.For example, in [103], the authors proposed a combination of Monkey Algorithm (MA) with Krill Herd Algorithm (KH) in order to balance the exploration and exploitation adaptively to find the optimal solution quickly.A similar case is presented in [81], where the diagnosis of childhood Atypical Teratoid/Rhabdoid Tumor (AT/RT) in Magnetic Resonance brain images and Hemochromatosis in Computed Tomography (CT) liver images has been improved through the hybridization of the Particle Swarm Optimization and Firefly (PSO-FF) algorithms.
Normally, in typical machine learning application, FS is used to optimize at least two major objectives: predictive performance and the number of selected features.To solve this multi-objective optimization problem, in some cases, researchers used several co-evolving swarms-so-called multi-swarms.For example, in [104], the authors developed a formulation utilizing two particles' swarms in order to optimize a desired performance criterion and the number of selected features simultaneously to reduce the dimensionality of spectral image data.
In [105], the authors proposed an SI FS algorithm, based on the initialization and update of only a subset of particles in the swarm, as they observed how feature selection failed to select small features subsets when discovering biomarkers in microarray data.Their approach increased the classification accuracy successfully, and decreased the number of selected features compared to other SI methods.
There are cases where the nature of the application requires very fast FS algorithms, such as data mining of data streams on the fly.An example of how to improve the speed of an SI based FS algorithm is presented in [77].The authors proposed an accelerated PSO swarm search FS by simplifying some formulas for updating particles and applying the swarm search approach in incremental manner, obtaining a higher gain in accuracy per second incurred in the pre-processing.
Authors in [13] exposed that PSO has similar advantages as genetic algorithms in terms of a straightforward representation.However, neither of them can be used for feature construction.Moreover, scalability might be a bottleneck for both of them, especially when we deal with problems with thousands or more features [75].Interestingly, we can support this claim by our study too, since nowadays, in the fields such as genetics and bioinformatics, the number of features may exceed a million.For such large dimensions, we need to use very specific algorithms as well as hardware.However, our study has not revealed any special solutions that apply feature selection methods using swarm intelligence on such dimensions.By the same token, no special datasets were found during our literature review.Additionally, characteristic of most SI algorithms is that they move individuals towards the best solution [13].That helps them to achieve better and potentially faster convergence that is not always optimal.Interestingly, even though PSO is a good and fast search algorithm, it can converge prematurely, especially in complex search problems.Many variants were proposed to deal with this problem [31].For example, an enhanced version of binary particle swarm optimization, designed to cope with premature convergence of the BPSO algorithm, is proposed in [30].On the other hand, lack of explicit crossover can lead to local optima.Some methods, for example product graphs that are similar to arithmetic crossover in evolutionary algorithms, are proposed [106] to cope with this problem.
Particle swarm optimization algorithm is easy for implementation according to genetic algorithms or genetic programming, as stated in [13].Contrarily, we cannot confirm that all SI algorithms are easy for implementation.

Issues and Open Questions
Our systematic review has revealed many issues, while some open questions that should be explored in the future years also appeared.Although research area is very vibrant, there is still room for improvement.Regarding the main issues, we can mention that some authors do not give special attention to the description of algorithms along with description of experiments.For that reason, many studies are not reproducible.It would be very important for the future that authors follow some standardized guidelines about presenting algorithms and experimental settings in scientific papers [107].On the other hand, many authors used number of iterations (n_ITER) as termination criteria instead of number of function evaluations (n_FES).Because of that, many discrepancies appear during the comparison of various algorithms.
In line with this, papers should also rigorously report how parameter tuning was conducted.As we can see from different papers that were reviewed in our survey, many studies do not report how they find proper parameter settings, although parameter settings have a very big influence on the quality of results.
It would also be important to encourage more theoretical studies about various representation of individuals.There is still unknown which representation either binary or real is better for particular family of problems.On the other hand, scalability of SI algorithms seems to be also a very big bottleneck, especially for problems with thousands or more features.Nevertheless, studying the effect of parallelization might be a fruitful direction for future research.Finally, developing new visualization techniques for presenting results is also a promising direction for future.

Guidelines for SI Researchers
Based on presented topics in previous sections, we can provide some practical guidelines for researchers which would like to conduct research in this field.
If an author wants to propose a new SI algorithm, it must obey the defined steps of the SI framework (see Section 2).We would like to highlight that the future of this field is not in proposing new algorithms, but in improving existing ones.Sörensen in his paper [108] pointed out that instead of introducing novel algorithms, which do not always contribute to the field of meta-heuristics, the existing ones should be used in combination with some other techniques.In this manner, some of the latest algorithms, e.g., BCO, FA, and GWO, are used in combination with some other techniques and present promising results in FS [39,40,42,53,64].
The optimization ideas for algorithms are collected in Section 5. We provide a broad overview of different approaches that have been used in individual stages, i.e., initialization, defined stopping condition, fitness function, and update and move agents.Researchers can not only change a specific phase of their algorithm, but they can also combine algorithms.Thus, a hybrid version of algorithms can be proposed.
Each algorithm must also be tested.In this stage, there are three possible ways, as presented in Section 6.Researchers can test their algorithms on benchmarks, available datasets or in real life applications.Algorithms that have been tested on available datasets are collected in Section 6.1 and are best seen in Table 2.The latter is divided not only by the size of the datasets, i.e., the number of features that each instance has, but also by the proposed taxonomy (Figure 2).In addition, algorithms that have been tested on real-life applications are similarly displayed in Table 3.Therefore, authors can easily find relevant articles based on their wishes.
Researchers also have some problems regarding the implementation of the algorithms, as stated in Section 7.1.With this kind of problems, existing libraries can be very useful.One of those is NiaPy microframework for building nature-inspired algorithms [109].NiaPy is intended for simple and quick use, without spending time for implementing algorithms from scratch.
A very important aspect touches also the design and conduction of experiments.A very big percentage of papers reported that comparison of algorithms were done according to the number iterations/generations.This should be avoided in the future, because number of iterations/generations do not ensure fair comparisons.
Since many SI algorithms have not been applied to the FS problem, the possibilities for further research in this field are vast.

Conclusions
This paper presents a comprehensive survey of SI algorithms for FS, which reduces the gap in this research field.We first analyzed 64 SI algorithms systematically, and sorted them as per the proposed taxonomy.After that, the SI for FS was analyzed based on the SI framework.
Different approaches are described in each framework phase of the SI algorithms.Moreover, to show the usage of FS in different application areas, we break down the presented papers in which researchers only used datasets for testing their methods and those who applied them to practical problems in the field.
Based on the analyzed data, and many papers comprising SI for FS as shown in Figure 2, it can be seen that the interest in this field is increasing, because we live in the era of Big Data, which is challenging not only for FS, but for all machine learning techniques.Therefore, the growing tendency of many researchers is to select only the most informative features out of the whole set of several thousands or even ten thousands of features.Such high dimensionality of data is most evident in Medicine in the DNA microarray datasets.Another reason for the increased research efforts in this field is the result of the SI algorithms' boom in recent years, which were later applied for the FS problem.
We think that the future in the SI field, not limited only to SI for FS, includes not only developing new algorithms based on the behavior of some animals, but also modifications of existing ones.The latter is expected to be done in the sense of combining different parts of algorithms together.Anyway, we strongly believe that the use of SI for FS will continue to grow and will provide effective and efficient solutions.
Let this review serve as a foundation for further research in this field.Researchers can focus on comparing different SI algorithms and their variants, and prepare in-depth empirical and analytic research.It would also be interesting to look at the computational performance and time complexity of such algorithms and compare their efficiency.
Because all of the novel algorithms must be tested, a systematization of datasets is recommended.As a basis, our division of datasets presented in Section 6.1 can be taken into account.We must also think in the direction of a unified measurement approach, which could improve comparison of the results.A more systematic approach to the evaluation of performance, advantages and drawbacks of specific algorithms would greatly improve the decision-making process of a data miner when facing the problem of feature selection.
It would also be fruitful to compare whether some algorithms perform better in some specific cases, e.g., on specific datasets, small/large number of features, etc., and if their hybridization, e.g., the introduction of some specific local search, affects all SI algorithms or just some of them.There is no obvious answer to what is the most effective SI algorithm in FS, for the following reasons.Firstly, the No Free Lunch Theorem [110] tells us a general-purpose, universal optimization strategy is impossible [111].Furthermore, not all SI algorithms, as presented in Table 1, have been applied to the FS problems, and most of those that were lack good empirical evaluations.Experimental results obtained on different real-world problems are even harder to compare.For that reason, a systematical and exhaustive empirical investigation should be conducted in the future on how different algorithms, especially those which have shown the most promising results thus far, e.g., PSO, ACO, FA, GWO, DA, and GSO, and some promising newcomers, perform in varying conditions (on different datasets, for very big datasets, etc.).In this manner, the implementation issues and the impact of different configurations shall also be addressed.
Nowadays, the majority of papers that address the feature selection problem use instances that have up to several thousand features.From the papers, it is evident that researchers manage to adapt algorithms to work on such dimensions, but the question that arises is: What happens when we scale algorithms to problems that have millions of features (such as in genetics and bioinformatics)?Since there is a big scalability gap in this field, this problem is another one that needs to be addressed in the future.

Figure 2 .
Figure 2. Swarm intelligence algorithms.* are added because the same abbreviations are used for different algorithms (see Table A1 in the Appendix A).

Figure 5 .
Figure 5.A general framework for wrapper models of feature selection.

Figure 7 .
Figure 7. Analysis of used datasets in swarm intelligence (SI).

Figure 8 .
Figure 8. Analysis of applications in SI.

Table 1 .
Hits for "swarm intelligence" AND "feature selection" in online academic databases from 2007 to 2017.

Table 2 .
Used dataset sizes in evaluating swarm intelligence (SI) for feature selection (FS).

Table 3 .
Application areas of SI for FS.

Table 4 .
Impact (share of citations) of SI algorithms for FS.

Table 5 .
Summary of selected algorithms and their components sorted by proposed taxonomy.