A Review of the Modification Strategies of the Nature Inspired Algorithms for Feature Selection Problem

This survey is an effort to provide a research repository and a useful reference for researchers to guide them when planning to develop new Nature-inspired Algorithms tailored to solve Feature Selection problems (NIAs-FS). We identified and performed a thorough literature review in three main streams of research lines: Feature selection problem, optimization algorithms, particularly, meta-heuristic algorithms, and modifications applied to NIAs to tackle the FS problem. We provide a detailed overview of 156 different articles about NIAs modifications for tackling FS. We support our discussions by analytical views, visualized statistics, applied examples, open-source software systems, and discuss open issues related to FS and NIAs. Finally, the survey summarizes the main foundations of NIAs-FS with approximately 34 different operators investigated. The most popular operator is chaotic maps. Hybridization is the most widely used modification technique. There are three types of hybridization: Integrating NIA with another NIA, integrating NIA with a classifier, and integrating NIA with a classifier. The most widely used hybridization is the one that integrates a classifier with the NIA. Microarray and medical applications are the dominated applications where most of the NIA-FS are modified and used. Despite the popularity of the NIAs-FS, there are still many areas that need further investigation.


Introduction
As data accumulate rapidly in databases and data warehouses, a dimensionality problem becomes the main challenge for machine learning tasks (e.g., classification or clustering) [1]. Many negative effects may result from scaling up the dimensionality of a data set. These include the existence of irrelevant and redundant features that may adversely affect the learning algorithm or cause data over-fit [2]. Thus, the development of effective data mining techniques becomes an urgent necessity in various fields such as medicine [3], bioinformatics [4], text mining [5], image processing [6], design of smart infrastructures and smart homes [7], financial estimation [8,9], coastal engineering [10], and sustainability [11]. Their significance depends on their ability to turn huge amounts of data into an acceptable form. This will simplify knowledge discovery and make huge data sets more understandable, analyzable, and predictable.
Feature Selection (FS) is a pre-processing data mining technique for dimensionality reduction [12]. In recent years, research in FS has been rapidly developed in line with the hybridization [33], updated mechanism, new initialization strategy, new fitness function, new encoding schemes, modified population structure, multi-objectives, state flipping [34,35], and parallelism [36]. Each modification addresses the weakness of the NIA algorithm in some issues without harming the essence of the algorithm and its logic. The research field of NIA-FS has witnessed considerable development. To show the expansion of the NIAs-FS models in the literature, Figure 1 illustrates the correspondence between the year and number of publications that combine modified NIAs with FS. In the first years, research was volatile, and there were also years of research disruptions. Since 2006, the number of publications has remarkably increased to reach its peak in 2018. Furthermore, the research in this area has become very effective in the last five years. An intensive search for surveys in this area found that there are very limited NIAs-FS surveys [20]. Some FS surveys did not refer to meta-heuristics at all, but focused on other issues such as data perspectives [19], supervised/unsupervised FS approaches [15], and other FS surveys were tailored to specific applications or limited to certain domains [37]. The analysis of FS surveys showed that either they briefly refer to the meta-heuristic FS or they do not refer to them at all. To our knowledge, there is no survey about a modified NIAs-FS. This finding was one of the main motivations for this work. Unlike the previous FS surveys, FS will not be discussed in isolation from other related issues. The main objective is to bridge the gap in FS surveys by providing a review of the important aspects and design issues of NIAs-based FS approaches. The main modification strategies that have been adopted to enhance NIA for solving FS problem are categorized and discussed.

1.
What is the current status of modified NIAs-FS research? 2.
What are the important aspects and design issues regarding building NIA for tackling FS? 3.
What are the modifications that were applied on NIA for tackling FS and in what domains were they applied? 4.
Are there current open-source software systems that apply a modified NIA-FS?
Based on the aforementioned research questions, we have constructed this review based on three primary issues: • Theoretical aspects of modified NIAs-FS provide detailed coverage for three main subjects: Meta-heuristic optimization, the FS problem, and modifications on metaheuristic to enhance meta-heuristics for FS; • Applied aspects of modified NIAs-FS presents different applications of modified NIAs-FS; • Technical aspects of modified NIAs-FS presents a new developed FS tool, named Evolopy-FS.
The review will refer to various well-regarded publishers such as ACM, Elsevier, Springer, IEEE, World scientific, Hindawi, and others. Figure 2a shows the number of publications for each NIA in main publishers regarding modifications for tackling FS. Figure 2b shows the number of citations for popular NIAs articles in the main publishers regarding modifications for tackling FS. A description of meta-heuristic optimization is presented in Section 3. Section 4 discusses the problem of feature space symmetry in datasets and the need for feature selection as a disentanglement of symmetry. Section 4 discusses the feature selection problem and its related issues. A review of different NIAs-FS modification techniques is presented in Section 5. Section 6 highlights the main applications on modified NIAs-FS. An assessment of NIA-FS is provided in Section 7. Finally, in Section 8, the outlook for the NIA-FS research field and possible future directions are discussed.

Feature Selection as a Task of Disentangling the Symmetry of Feature Space
The aim of supervised machine learning is to estimate a function f that fits well with the features of training data and allows to predict the outputs on previously unseen inputs. The number of samples required for training grows exponentially with the dimension of a feature space, which is known as the "curse of dimensionality" [38]. To approximate a Lipschitz-continuous function composed of Gaussian kernels placed in the quadrants of a d-dimensional unit hypercube (blue) with error , one requires O(1/ d ) samples [39].
Intuitively, a symmetry of an object is a transformation that leaves certain properties of the object invariant. For example, translation and rotation are symmetries of objects, which do not change their representations [40]. The geometric structure of the feature space imposes the structure on the class of functions f that we are trying to learn. One can have invariant functions that are unaffected by the action of the group, i.e., f (ρ(g)x) = f (x) for any g ∈ G and x, here G is the symmetry group, g is the symmetrical transformation in the feature space, ρ(g) is the group representation, and x is an input in the space of input signals G(Ω) that acts as a point in the feature space. Such symmetrical transformations (e.g., translation, rotation, shifting) are commonly used for data (image) augmentation to increase the number of data instances for effective training of machine learning models.
The goal of feature selection is to eliminate uninformative and/or redundant features from the feature space, leaving only relevant (i.e., predictive) features [41]. Feature selection seeks to decrease M to M and M << M for a dataset with N samples and M dimensions (or features). In other words, feature selection produces a disentangled representation [40] with respect to a particular decomposition of a feature space with some symmetry group, which may be useful for subsequent tasks, such as reduced complexity of training a machine learning classifier. Such disentanglement, in fact, is performed by a neural network as a part of the classification process by learning the weights of a network nodes [42], which produce asymmetric activations for separation of classes.
The redundant features are characterized by a high level of inter-correlation. Such correlated features result in the symmetrical distribution of instances in feature space.
Feature selection aims to reduce feature dimensionality by reducing the symmetry in feature space. The resulting distribution of classes in the lower-dimensional feature space should be as asymmetrical as possible to allow for easy separability of classes [43]. Furthermore, a strong correlation in features might result in numerous near-optimal feature subsets, making traditional feature selection approaches unstable and lowering the trust in selected features [44]. As many different feature space decompositions are possible, the problem of finding an optimal feature subspace in a high-dimensional feature space is known to be NP-hard [45]. In this paper, the nature-inspired meta-heuristic optimization algorithms are studied for solving the feature selection problem.

Meta-Heuristic Optimization
Meta-heuristic algorithms are characterized by flexibility, simplicity, low cost in computations, and they are derivation-free methods. The principle of meta-heuristics is reasonability vs completeness. In other words, it gives up completeness for providing approximated solutions for complex unsolved problems. Meta-heuristics are further categorized based on the number of candidate solutions encountered during the optimization process into the trajectory and population.

Trajectory-Based Optimization
A trajectory algorithm begins with one random solution and it tries to optimize the solution until a stop condition is satisfied. The computation overhead is reduced significantly because only one solution is being improved and evaluated during the optimization process. Equation (1) Trajectory algorithms are local search techniques. They depend on making a few changes in the components of the current solution to find a better one. A potential solution is picked, and its neighboring solutions are checked if they are better. Local search implies searching within a limited region (exploitation). This process suffers from a potential entrapment in local minima because of the diversity weakness and a lack of information exchange. Examples of trajectory algorithms are Simulated Annealing (SA) [46] and Tabu search (TS) [47].

Population-Based Optimization
A population algorithm begins with a set of randomly generated solutions and tries to enhance them during the optimization process. Each candidate solution fluctuates outward or converges toward the best solution following a certain mathematical framework. The predominance of these algorithms is because of their simplicity and flexibility. Simplicity means that they are built upon simple methodologies and are evolved from simple concepts. They can be adopted to deal with real-world problems without structural modifications. All that is required is an accurate representation of the problem and the structure of the optimizer is left untouched. Population algorithms are more efficient in mitigating local minima compared with trajectory algorithms because more individuals and more information are shared between them. However, multiplicity in solutions increases the computation burden because more evaluations are required. The number of calls for a fitness function is driven by the number of individuals and the number of iterations. Equation (2) identifies the number of function evaluations in population algorithms where N is the number of individuals and T is the number of iterations: A population algorithm begins with the initialization step where a set of candidate solutions are generated. The solution is a candidate or possible solution if it satisfies the constraints of the problem. The next step is the evaluation of individuals. The evaluation is carried out using a specified fitness function and in terms of predefined evaluation criteria. The fitness function is called for each individual so that each individual gets a fitness value. After evaluating the individuals, the update process refines and improves current solutions. This requires updating the positions of individuals in the search space. This iterative process of evaluating and updating individuals continues until a predefined criterion is satisfied and the global optimal solution is best approximated.
Population-based algorithms compromise of NIAs that are the result of the union of nature with different scientific fields including physics, biology, mathematics, and engineering. Computer science utilized these relations between science and nature and turned it into a well-defined discipline for optimizing different challenging problems. NIAs are categorized based on the source of inspiration into EA-and SI-based algorithms [21].

Evolution-Based Optimization (EA)
This category includes different computational systems that share in their emulation for the biological evolution. EAs model the natural cellular processes such as reproduction, mutation, recombination, and selection.
EAs typically designed by generating a population of possible solutions − → I 1 , − → I 2 , − → I 3 . . . − − → I n−1 , − → I n called chromosomes. Each chromosome is split into smaller units called genes. The length of the chromosome (#genes) determines the dimensionality of a problem. The relation between gene, chromosome, and population can be expressed as gene ⊂ chromosome ⊂ population. Most of the current evolutionary frameworks implement the chromosome and population as 1-d array (vector) and 2-d array, respectively. Equation (3) identifies the individual I i with a length d and Equation (4) identifies a population P where each individual represents a row in a matrix. Each solution is evaluated by a certain object function to determine its quality and decide if it is fitted or unfitted. The highest evaluated solution (best individual) is preserved at each iteration. The unfitted solutions (worst individuals) are candidates to be replaced by newly-generated offsprings. This allows the average fitness value to increase dramatically throughout iterations. Common EA examples are GA [22] and DE [23]. GA undoubtedly is the most widespread and typical example of EAs:

Swarm-Based Optimization (SI)
SI algorithms have a common behavior that is very similar to the social behavior of creators. The Swarm system comprises an abundant number of agents that are distributed in the environment to achieve a global target. Intelligence can be seen in the actions of agents to coexist. The main characteristics of swarm systems are adaptability, self-organization, distributed control, scalability, and flexibility [20]. The most common SI examples are Particle Swarm Optimization (PSO) [24] and Ant Colony Optimization (ACO) [25]. A PSO source of inspiration are flocks of birds that search for food. The search procedure is guided by two main factors: Pbest and gbest. Pbest represents the best experience that was gained by the previous particle itself. Gbest represents the best individual in the whole swarm. Particles also have a position and velocity that are both updated in each iteration.

Challenges of Meta-Heuristic Optimization
Despite the efficiency of meta-heuristics in tackling challenging optimization problems, some obstacles impact their performance. These include dynamicity, multi-objectivity, constraint, and uncertainty. For multi-objectivity, there are multiple conflicting objectives to be optimized until trade-offs (Pareto optimal set) are achieved. The search space is quite more complex. The optimization problem becomes highly challenging when the number of objectives becomes larger than four [48]. Many objective fields have emerged to deal with these cases. Constraints of real problems create gaps in the search space by dividing it into feasible and infeasible regions. Feasible regions satisfy the constraints while infeasible regions violate these constraints [49]. Accordingly, the optimization algorithm should follow certain mechanisms to become closer to the promising region and avoid the infeasible region until an optimal solution is found. The other main issue of meta-heuristic optimization is uncertainties. For example, the global solution frequently changes its position in the search space, which requires more attention from the optimization algorithm. Some operators are used for registering the history and memorizing the locations of the global optima all the time. Other severe challenges are related to the problem search space, such as the existence of many holes or valleys that lead to stagnation in local minima, discontinuities in a search space, the location of global optima that comes onto the boundary of a search space (the boundary of constraints), and the isolation of global optima [32].
Population algorithms are characterized by two conflicting milestones that are called exploration (diversification) and exploitation (intensification) [32]. In exploration, the candidate solutions churn and change violently, which leads one to examine more regions and to find diverse solutions. Exploitation changes gently and causes a less sudden stir for the candidate solutions. GA realizes these processes through crossover and mutation operators. Crossover intermixes a combination of solutions while mutation squeezes certain regions and searches locally. PSO configures the inertia weight operator by large values for more explorations and selects small values for more exploitation.
The main challenges of exploration and exploitation include: Firstly, since they have conflicting purposes, increasing any of the causes decreasing the other. Secondly, a transition between these two milestones is not defined because the search spaces of the optimization problems are usually unknown. Thirdly, performing pure exploration causes less accuracy in approximating an optimal solution because different regions are being explored without a focus on a certain promising region. Performing pure exploitation gives rise to entrap in local optima. Fulfilling a balance between exploration and exploitation may produce better results and increases the chance of being close to the optimal solution. Recently, this idea has become an active research problem. Several types of research have tried to attain balance by integrating several random and adaptive operators in the structure of the algorithms.

Feature Selection
This section introduces FS in two parts: The dimensionality problem and the FS system based on the NIA search strategies.

Dimensionality Problem
Due to the incremental growth of information and the abundance of data, data sets have increased in both data samples (number of instances) and dimensions (number of features). As a result of the increased dimensionality, different negative effects were embedded in data mining tasks. One of these problems is called the curse of dimensionality, which describes the status of data as it becomes sparser in large dimensionality space [12]. This raises the need for more instances for the training of the classifier, which increases the learning time. Learning algorithms were designed to build their models based on rules inferred from a small number of dimensions. Learning algorithms cannot generalize well in a large dimensionality space. High dimensionality implies the existence of noisy features, such as redundant and irrelevant features that mask the informative features and mislead the classifier and cause data to overfit. An overfitting [2] problem occurs when a classifier overtrained on the data and learned all examples, including outliers. Considering that noise and random fluctuations as related concepts will cause building complex models; logically, learning from relevant features allows the classifier to be more accurate. Another negative effect of increasing dimensions is the increased demand for specialized devices such as large memory storage and high-speed processors, which increases cost.

FS Preliminaries
Features are defined as measurable properties of the observation under study. The complexity of the problem is determined by its features. In real-world applications, the discovery of relevant features is a big challenge. In 1997, the first papers about relevance and feature selection were published [14]. Feature relevance can be formalized as follows. Let 1 ≤ i ≤ n, E i be the domain of feature x i , X = x 1 , x 2 , · · · x n be the set of all features. E = E 1 × E 2 · · · × E n is the instance space from which instances derive their values. Each instance can be represented as a point in space and the distribution of these data points has a probability P. If we consider the class (label) space to be T, then we can define an objective function c as a relation that maps an instance S to a specified label/class in labels space T as: c: E → T. Arguably, a data set with |S| number of instances is the result of sampling |S| times from E with a probability P and get label from T. An x i in X is a relevant feature with respect to class concept if there exist two instances (A and B) in E, which only differ in their assignment to xi (all their feature values are the same except those for feature xi) and c(A) = c(B). In contrast, a variable with no correlation or weak correlation with the target concept is called an irrelevant feature. Other types of noisy features are redundant features. These are features that are highly related and connected with other features and add nothing new regarding the classification decision.
In the literature, FS was defined in different ways, which are all close in meaning and intuition [15]. FS is a searching process that tries to find the subset of features which is the best one to describe the data. According to relevance discovery, FS aims to determine the most meaningful subset of features, which has the largest relevance and minimum redundancy. Even though those features are fewer than the original features, but they carry the maximum discriminate information. Classically, FS selects a subset of M features from a set of N features where M < N and the value of an evaluation function is optimized over all subsets of size M. The essence of FS is to select or discard features intelligibly in such a way, the resulting class distribution is as close to the class distribution with the complete set of features. In another meaning, FS is not a technique for only reducing data set cardinality, but it should find a trade-off and a balance between different conflicting objectives. As a multi-objective optimization problem, there are two primary objectives to be optimized. These objectives are the performance and the number of selected features. These are conflicting objectives because the optimization algorithms require getting the maximum performance and the minimum number of selected features.
Typically, the standard process of FS consists of four primary stages of subset generation, subset evaluation, stopping criterion, and results validation [15].
Regarding the subset generation and search procedure, FS is considered an NP-hard problem. When the number of features equals n, the search space comprises 2 n subsets of features. Using brute search methods such as a huge search space needs an exponential running time to traverse all the candidate subsets of features.
Concerning subset evaluation, there are different methods to assess the goodness of a feature subset such as filters and wrappers. A stop criterion is a condition that halts the FS process and prevents the infinite loop. For example, the search completion (all feature subsets have been examined), the learning performance reached its highest limit, the subset of features with a specified size is obtained, the pre-defined number of iterations is reached, the occurrence of conversion situation in which results become stable, and no further enhancement is achieved. A direct way to validate the obtained results is based on prior knowledge from a domain. Unfortunately, this features knowledge is usually unavailable so other methods have to be used instead. FS could be validated by comparing the system performance using the whole subset of features with its performance using the selected features. FS has many advantages that positively affect the data mining task, including improving the quality of the generated model, speeding the learning time of the classifier, enhancing the ease of reading the data set, and reducing the need for more hardware resources.

NIAs for Feature Selection
Two important points should be focused on: The representation of a solution and the evaluation for it. Normally, a feature subset is represented by a binary vector. The dimensionality of the problem is equal to the number of features in the data set. If the gene value is set to 1, this indicates that the feature is selected, otherwise, it is not selected. The quality of a feature subset is evaluated based on two contradictory objectives: The classification accuracy (minimum error rate) and the minimal number of selected features simultaneously. These two criteria are represented in one fitness function that is shown in Equation (5), where αγ R (D) is the error rate of the classification produced by a classifier, |R| is the number of selected features in the reduced data set, and |C| is the number of features in the original data set, and α ∈ [0, 1], β = (1 − α) are two parameters for representing the significance of classification and length of feature subset according to recommendations:

NIAs FS Modifications
This section highlights the main modification techniques applied in the literature to enhance the NIAs as wrappers FS. By referring to 156 articles in the domain of modified NIAs-FS, it can be noticed that the modification techniques can be classified into nine categories as depicted in Figure 3: New operators, hybridization, update mechanism, modified population structure, different encoding scheme, new initialization, new fitness function, multi-objective, and parallelism.

New Operators
This modification depends on integrating a new operator in the original NIA structure to achieve certain targets, such as improving the algorithm performance, increasing the diversity among the population, enhancing the exploitation and exploration processes, facilitating the sharing of information between population's individuals, repositioning of the worst individuals in the population, and performing a search along various vectors in search space [36]. In literature, several operators have been used to enhance NIAs wrappers. Some of these operators are discussed next.

Chaotic Maps
The denotation of chaos means a state of disorder. In mathematics, it is a formula that describes a dynamic system with time dependence. The chaotic system has a high level sensitivity to its initial conditions. This behavior implies that even a simple modification in the initial conditions will lead to big changes in the outcomes. Although the chaotic system is deterministic and does not incorporate any randomness but the results are not always predictable [50].
Chuang in [51] used two kinds of chaotic maps and integrated them with Binary Particle Optimization (BPSO), namely logistic maps and tent maps. Equation (6) describes how the logistic map is written in mathematics (general formula), where X n is a number between 0 and 1 which represents the ratio of the current population size to the maximum population size and µ is a constant value between 0 and 4. Equation (7) describes how Chuang exploited Equation (6) to modify the inertia weight value where w is the inertia value between (0,1) and t is the number of iteration. The same thing was followed to apply the tent map chaotic map. Equation (8) is the general mathematical formula and Equation (9) is the modified version of inertia weight using a tent map. Using large values for inertia weight facilitates more exploration while selecting small values facilitates more exploitation. Hence, chaos theory could be used for balancing the two types of search in the search space. Besides, the study contributed that Chaos Binary Particle Swarm Optimization (CBPSO) with a tent map achieved a higher classification accuracy than CBPSO with a logistic map: In the same year, Chuang presented another model for FS [52]. The proposed model was a filter-wrapper approach based on using a correlation-based filter (CFS) and Taguchi chaotic BPSO (TCBPSO). In [53], chaotic was applied with BPSO for FS in text clustering. Ahmad in [54] used chaotic maps as modifications for the SSA algorithm. He replaced the C3 random parameter with chaotic sequences, namely logistic map, piecewise map, and tent map. It was clear the impact of chaotic maps in improving the SSA. In the same year, the influence of chaotic operators on SSA was investigated in [55]. The experiments proved that the logistic map achieved a better performance for the SSA algorithm over nine chaotic maps. The chaotic multiverse optimization (MVO) FS model was proposed in [56] to cope with some limitations of MVO. Tent, logistic, singer, sinusoidal, and piecewise chaotic maps were used. The results showed that the logistic chaotic maps were the best, which increased the MVO performance more than other maps. Sayed in [57] developed a new wrapper FS approach based on the Whale Optimization Algorithm (WOA) and chaotic theory named CWOA. He used 10 chaotic maps namely chebyshev, circle, guass/mouse, iterative, logistic, piecewise, sine, singer, sinusoidal, and tent. The results showed that a circle chaotic maps was the best among other chaotic. In [3], a model based on chaotic Moth Flame Optimization (CMFO) and Kernel Extreme Learning Machine (KELM) was proposed. In [58], Sayed developed a new FS system composed of the Crow Search Algorithm (CSA) algorithm and chaos theory to enhance the performance and convergence speed of CSA. Lately in [59], a Binary Black Hole optimization Algorithm (BBHA) has been modified by embedding new chaotic maps embedded with the movement of stars in the BBHA. This model was called CBBA and uses 10 chaotic maps. The results of three chemical data sets demonstrated that CBBA outperformed the BBHA in terms of the number of selected features, classification performance, and computational time.

Rough Set
Rough Set (RS) was first described by Zdzislaw Pawlak at the beginning of the 1980s [60]. This is a mathematical concept related to topological operations. In mathematics, RS is a theory that tries to find two approximate sets for the original conventional set (crisp set). The first RS gives the lower approximation for the crisp set which compromises the elements that surely belong to the target subset. The second RS gives the upper approximation of the crisp set which compromises the elements that possibly belong to the target subset. The pair of rough sets are themselves either crisp sets or fuzzy sets. Rather than belonging or not belonging in relation to the elements as in crisp sets, the fuzzy sets depend on the membership function for gradual assessments of the elements. Unlike the fuzzy sets, RSs depend on finding the positive region, not the membership function for dealing with uncertainties and vagueness. The RS has many advantages, including the approximation of concepts, reduction of spaces, discovering the equivalence relations, and finding the minimal sets of data in vague and uncertain domains. In FS, the RS tries to define the attribute dependency. Zainal in [61] proposed the RS-PSO model for a better representation of data. Another RS-PSO-FS model was proposed in [62] based on Relative Reduct (PSO-RR) and PSO-based Quick Reduct (PSO-QR). Both tools depend on the dependency measure for comparing sets of attributes. In [63], the authors proposed a model for FS in nominal data sets based on BCS and Rough Sets. Another CS model was introduced in [64] by incorporating the RS with different classifiers. In [65], a new model was developed based on two incremental techniques (QuickReduct and CEBARKCC). Quick reduct and CEBARKCC are two filtering methods where the former one is a rough set-based filter that simulates the forward generation method and the latter is a conditional entropy-based method. These two methods were integrated with the Ant Lion Optimization (ALO) algorithm to improve the initial population quality. The RS-FA model was developed in [66]. Hassanien in [67] developed a new system based on rough set and MFO. Lately, in [68], a hybrid model called BPSOFPA composed of Flower Pollination Algorithm (FPA) and PSO was also developed. BPSOFPA was integrated with the RS approach for the FS problem. Ropiak in [69] integrate RSs with deep learning as rough mereological granular computing.

Selection Operators
Inspired from Darwin's theory [70], which explained the evolution and changes in species through the natural selection mechanism, the genetic algorithm incorporated selection operators to select some individuals from the population for later breeding. A conventional strategy to implement the selection is using the fitness values of the solutions. In other methods, these fitness values are normalized by finding the summation of them then divide the fitness of each individual by this summation. Another method sorts all individuals in the population according to their fitness values in descending order. The selection mechanism was applied in other studies by finding the accumulated fitness for each individual so that the final individual fitness value is one [71]. All such methods become computationally expensive and may negatively impact the performance of GA when the population becomes larger. Other methods of selection which are widely implemented with GA are Tournament Selection (TS) and Roulette Wheel Selection (RWS). The stochastic nature of these methods makes them simpler in implementation and better in performance than the aforementioned methods. TS is the most applied selection operator with GA because of its simplicity. It selects randomly a set of solutions from the population then the best one is used for breeding the successive generation. In RWS, the mechanism differs in that no agent in the population is discarded. The RWS strategy depends on creating something like a roulette where all fitness scores of the individuals are represented as areas or sectors on this roulette. The individual with a large fitness value well reserve a large sector on the roulette, which shows a larger probability for selection. Individuals with small fitness scores will reserve small areas on the roulette. In RWS, the final selection for the agent is done by rotating the roulette and the selected individual is the one where the point stayed when the roulette had stopped. Mafarja in [46] developed a new model that combines TS with the WOA optimizer to enhance the exploration of the search. One year later, Mafarja presented in [72] an FS model based on the Grasshopper Optimisation Algorithm (GOA) algorithm with RWS and TS. Mafarja in the same year developed a new wrapper FS model based on WOA along with studying the effect of TS and RWS [73]. In [26], the selection operators were incorporated to improve the ABC optimizer. In [74], the method compromised of a DE optimizer and RWS structure for the selection of the Wavelet Packet Transform.

Sigmoidal Function
A sigmoid function is a mathematical function that falls under the S-shaped family and is considered a special case of a more general function called a logistic function, which has the mathematical formula defined by Equation (10), where e is the natural logarithm base (Euler's number), x0 is the sigmoid's midpoint, L is the sigmoid's maximum value, and k is the logistic growth rate of the curve [75]. The sigmoid function formula is defined by Equation (11): The sigmoid function has some special characteristics including the monotonic behavior, which means that the function is defined on all real numbers but the output of the function is increasing either from 0 to 1 or from −1 to 1. Moreover, the sigmoid function is differentiable and has a bell-shaped first derivative where the derivative at each point is a non negative value. There are several variations of the sigmoid function such as hyperbolic tangent, arctangent function, and algebraic functions which are respectively defined by Equations (12)- (14). The sigmoid function is widely applied as the activation function of a Neural Network (NN). Other useful usage of the sigmoid function is that it is used as a discretization method to convert a continuous space into a binary one, such an application is a feature selection application: For solving the FS problem, Aneesh developed a modified BPSO called Accelerated BPSO (ABPSO). The strategy for accelerating the particles was using a new velocity update function based on a sigmoidal function [76]. In [6], the sigmoidal function was used with BGWO in solving FS. In [77], different transfer functions that map continuous solutions to binary ones were applied in combination with the CS algorithm. The CS-sigmoid and CShyperbolic tangent was performed on five data sets. In [78], the effect of different transfer functions on the Bat optimization (BA) algorithm was studied. Sigmoid and hyperbolic tangent functions were used to analyze their influence on FS. The results proved that the sigmoid function was better than the hyperbolic function in feature reduction for almost all data sets. Mafarja, in [79], presented new versions of the Grasshopper Optimization Algorithm (GOA) based on sigmoid and V-shaped TFs in the context of FS.

Transfer Functions
Transfer functions (TFs) are mathematical formulas that play a significant role in mapping a continuous search space to discrete search space. The discrete search space could be viewed as a hyper-cube in which solutions move in different directions within its boundaries by flipping their bit values. TFs are one of the most efficient ways that could be utilized to covert continuous meta-heuristic algorithms into their corresponding binary versions [80]. The mathematical formulations of these TFs can be found in [80]. The update procedure in a binary meta-heuristic algorithm is switching solutions elements between 0 and 1 based on certain mapping formula TFs that links the original continuous update procedure with a new binary update procedure. TFs in a close meaning define the probability of updating each element (gene/feature) in a solution to be either selected 1 or not selected 0.
Equations (15) and (16) define the general update formulas of a solution using S-TFs and V-TFs, respectively, where X d i (t + 1) represents the ith element (gene/feature value) in the X solution (feature subset) at dimension d (feature number/index) in iteration t + 1, rand ∈ [0, 1], which was generated using a random probability distribution: These can be reformulated to preserve the concepts of searching using any specific meta-heuristic algorithm. As an example, PSO was converted by Kennedy and Eberhart [81] from a real algorithm to a binary algorithm. The PSO binary conversion started by employing a sigmoid function to convert the velocity values into probability values bounded in the interval [0,1] as in Equation (17), where T(v d i (t)) indicates the velocity of particle i at dimension d in iteration t. In the next step, the computed probabilities are used to update the position vector using Equation (18). To preserve the PSO continuous searching method and keep the concepts of pbest/gbest, the TF gives a high probability for switching gene values for those genes having high-velocity values since they are far away from the best solution. Small probability is given for genes having small velocity values since they are considered close to the best solution [80]: In the literature, there were several studies that adopted TFs operators with FS problem. Mirjalili in [80] improved the performance of BPSO by using TFs, S-shaped, and V-shaped transfer. The results of V-TFs improved the performance of BPSO more than S-TFs. In [82], a new wrapper was developed by modifying the Salp Swarm Algorithm (SSA) using TFs. The proposed approach achieved significant superiority over other competitive approaches in 90% of the data sets. Mafarja in [83] presented a new wrapper FS method based on a modified Dragonfly Algorithm (DA) using time-varying S-shaped and V-shaped TFs. Recently, in the context of Internet of Things (IoT) attack, a new wrapper-based approach using the WOA was developed. The augmented WOA used both V-shaped and S-shaped transfer functions.

Crossover
In living things, the chromosomal crossover is a recombination process that occurs between non-sister chromatids to exchange the genetic material during recombination (sexual reproduction). This process ends in the production of new recombinant chromosomes. Faraway from the biological chromosomal crossover in the genetic algorithm and evolutionary computation, this process was inspired to exchange information between solutions in the population and generating new offsprings in the next generation. In the genetic algorithm, recombination (crossover) is defined as a stochastic operator that enforces the diversity in the population by exchanging (swapping) the bits after a random cutting point (crossover point) between the parents' vectors (selected individuals) to produce new children (offsprings). Equation (19) shows how a crossover operator is used to combine solutions where 1 is an operator that performs the crossover scheme on the two binary solutions X i and X i−1 . In a binary space, the crossover can be realized by exchanging the binary bits of two solutions to obtain an intermediate solution. Equation (20) shows that the crossover mechanism switches between two input vector with the same probability, where Xd is the value of the dth dimension in the yielded vector after applying the crossover operator on X i and X i−1 : In [84], a crossover operator was applied in combination with the sigmoid function to modify a Binary Grey Wolf Optimizer (BGWO). The BGWO1 approach was used to convert the Continuous version of GWO (CGWO) into the binary version. The first steps toward the three best solutions are converted into binary, then a random crossover is applied among them to find the updated position. The results of the approach positively affected the performance of GWO. In [82], the crossover operator was applied to improve the Salp Swarm Algorithm (SSA) optimizer in solving the FS problem. The crossover job was to increase the diversity of the model and improve the exploration process of the search space. In [73], the study incorporated many modifications strategies with WOA. Solving the limitations of the WOA represented by local minima and slow convergence was the priority. The crossover was used for achieving this target. Mafarja in [79] applied multiple operators with GOA. The combination operator together with the mutation was applied in his approach to BGOA-M for achieving more exploration.

Mutation
In the organism, the mutation is an error that occurs during DNA replication (meiosis). The error specifically results from a permanent deletion, insertion, or alternation on the DNA segment (nucleotide sequence of the genome). Even though this is a small genome error, it causes abnormal changes in the characteristics of an organism. Evolutionary and genetic algorithms inspired the same idea to make changes and increase the diversity in the population. The advantages of mutation come from preventing solutions becoming similar and thus ensuring the evolution does not stop. Mutation operators alter one or more gene values (a bit in chromosome vector) which causes the solution to be changed from its previous state. Besides diversity, the mutation could contribute to mitigating the local minima problem. Equation (21) identifies the mutation process where Xi(t + 1) d is the ith element at the dth dimension in the X i solution, In [85], a Particle Swarm Optimization (PSO) applied mutation to a solution was conducted after it was updated. A probability commonly 1/n indicates one bit of the solution will be muted (flipped). The model proved the effectiveness of the suggested modified PSO-FS model. In [53], the authors developed a hybrid intelligent algorithm that combined mutation with the BPSO and other operators to solve FS in the text clustering. The model attained a higher clustering accuracy and improved the convergence speed of BPSO. In [79], the mutation operator was applied with the GOA optimizer. The BGOA-M approach achieved superiority in comparison with other approaches compared. In [86], an Improved Harris Hawks Optimization (IHHO) was proposed based on elite oppositebased learning, mutation neighborhood search, and rollback strategies to increase the search performance.

Levy Flight
Levy flight has its source from chaos theory. It describes a random walk that follows a heavy-tailed probability distribution. This probability distribution represents the steplengths that take place either on a discrete grid or continuous space. In mathematics, according to a central limit theorem, the steps from the original point of a random walk follow a stable distribution which could be modeled using equations of Levy flights. Investigators in nature found that Levy flights can describe the animals hunting patterns especially when the prey is sparsely distributed and not easily detected as opposed to Brownian motion, which can only approximate the prey place when the hunting is near an abundant and predictable prey [87]. In [64], a novel Cuckoo Search (CS) algorithm was developed using the Levy flight with the rough sets. He applied his idea by integrating the Levy flight random probability distribution in the equation that generates new solutions as shown in Equation (22) where ⊕ denotes the entry-wise multiplication, α is the step size, α > 0, and Levy(λ) is the Lévy distribution which is described in Equation (23). In [88], Levy flight was used in combination with transfer functions to enhance the performance of the MFO algorithm and increase diversity:

Other Operators
A local search operator was incorporated with GA to mitigate the weakness of standard GA in fine-tuning near the local minima [89]. In [90], local search was used to improve the BPSO. A new local search and gbest resetting strategy called PSO-LSRG was proposed in [24] to facilitate the exploitation. A Uniform Combination (UC) operator was used in [80] to improve the performance of BPSO. Later, UC was adopted in [91] to balance the exploitation and exploration of bones PSO. The DE evolutionary operator was used in [5] to solve the local optima in standard WOA. The DE evolutionary includes mutation, crossover, and selection operators. Boolean algebra (and operator) was used in BPSO [92]. The bacterial evolutionary algorithm and PSO algorithm, both with a plain and a memetic variant complemented with gradient-based local search and fuzzy logic numbers were used in [93] for solving various resource allocation problems.
A catfish strategy was applied in [94] to improve the performance of BPSO based on introducing new particles into the search space when there is no improvement in the searching process. For example, when the gbest is unchanged over a consecutive number of iterations. The catfish particles replace the particles with the worst fitness and initialize a new search from the extreme positions of the search space. Feature subset ranking was introduced in [95]. The idea was to compute the significance of each feature according to its classification accuracy and compute the accuracy for some combinations of these ranks, then the BPSO wrapper approach was used to search on the top-ranked features subsets instead of the whole features.
A Gaussian operator was introduced in [96] and the idea was that FS is highly influenced by features interaction. The highly relevant features with a class label may have high interactions with other features which makes them redundant. On the other hand, irrelevant features concerning a class label may have small interactions with other features. As feature interaction is a challenge to classification and FS, a statistical clustering method based on Gaussian distribution was adopted. It groups homogeneous features based on the interactions between features then the PSO algorithm selects one feature from each cluster. Threshold was adopted in [97]. The idea was to set a nonzero value for a threshold based on the number of trails BPSO were run. The significance of a particular dimension is measured based on the frequency of appearance for that dimension in the gbest vector in all runs. The final gbest after thresholding will contain the most recurrent features.
Zhang in [91], used the Gaussian sampling to compute the positions of particles which is based on pbest and gbest instead of velocity. Another operator was incorporated, called reinforced memory. Reinforced memory is based on the idea of enhancing the probability of survival for outstanding genes. These are the important features with high fitness value in the current iteration. Consequently, the update of the local leaders (pbest) of each particle will avoid the gene degradation and preserve it in the next iteration. Hamming distance was used in [98] to replace the Euclidean distance in BPSO. Particularly, it was used to measure the distance between two binary vectors based on the Exclusively-OR (XOR) operator and count the number of ones in the resulting vector. In [99], a new model called Hybrid Particle Swarm Optimization Local Search (HPSO-LS) was proposed based on using local search with correlation information. The correlation information was used to guide the local search in PSO. This was carried out by including the most dissimilar features (low correlated) as a feature subset in the newly generated particles. Consequently, similar features (highly correlated) have less chance to be selected as a feature subset. Moreover, HPSO-LS used a specific subset size determination scheme to allow PSO to search within the abounded region and find a smaller number of features.
Binary quantum was used in [100] to modify and improve the PSO. The idea was to perform a sampling around the personal best and compute the mean best of the sampled points then introduce this value in the BQPSO. For any bit position of the mean best, it will be equal to 1 if 1 appears more often than 0 in all the corresponding bit positions of all pbests. On the other hand, if the 1 and 0 have the same frequencies, then each element of the mbest is set randomly either to 0 or 1. A re-initialization strategy was applied on PSO-mGA in [101]. The idea was to use a small population (3-6 chromosomes) with a reinitialization strategy to achieve convergence. A non replaceable memory operator was added to keep the original swarm and remains intact with it during the optimization process. This will help in increasing the diversity of a swarm. Moreover, the nonreplaceable memory was used for maintaining a secondary swarm with a leader and followers. Zhang in [102] developed a new wrapper-based approach by utilizing the Firefly Algorithm (FA), Return-cost, Pareto dominance-based, and adaptive movement operator. A return-cost indicator was used to compute attractiveness. The firefly is cloned based on the return cost instead of the distance so that the firefly with a big return and small cost has a great chance to be cloned. A pareto dominance-based operator was added. Pareto dominance is commonly used in multi-objective optimization. It is a selection strategy used to search for the attractive one of a firefly based on the cost and return. Adaptive jump was used in place of the fixed uniform jump. It requires a change in the jump probability based on a linear function concerning the number of iterations to allow for more exploration.
In [103], a greedy search was used to enhance the local search. Three modified versions of the Lion Algorithm (LA) (Lion M1, Lion M2, and Lion M1+M2) were proposed to improve the local search. Mafarja, in [72], applied a new methodology based on BGOA and Evolutionary Population Dynamics operator (EPD). EPD depends on making a local change in the population instead of external force. This idea comes from the theory of Self-organized Criticality (SOC). Hancer, in [26], developed a new version of the DisABC algorithm for FS by introducing a DE-based neighborhood mechanism into the similaritybased search of DisABC. DE evolutionary operators were also used in [5] for solving the problem of local optima in native WOA. These include mutation, crossover, and selection operators. Khushaba in [74] developed a new modified FS method called DEFS using a repair mechanism. The repair mechanism was based on feature distribution measures and the RWS structure. A new model was developed in [104] based on GA and m-features (OR operator). The OR operator performed a search space reduction and improved GA performance and convergence. Zeng in [105] developed a novel GA with a new population structure and a new operator called dynamic neighboring. Dynamic neighboring is a new selection strategy that was used to boost the capabilities of GA for the FS problem. In [106], Guo proposed a new repair operator that allowed GA to transform feature subsets from arbitrary combinations to valid combinations that conform to the feature model constraints and domain-specific objective function.

Hybridization
Hybridization means the integration of over one algorithm to build a powerful predictive framework that combines the power of the integrated algorithms. The expectation of combining the complementary features of different optimization strategies is to achieve a better performance compared with implementing them separately as pure paradigms. There are several categories of NIAs hybridization techniques that were investigated in the literature such as combining NIA with other NIA or combining NIA with other algorithmic components from different areas of optimizations, such as with tree search, dynamic programming, and constraint programming [107].

NIA-NIA Hybridization
In mimetic models, a single solution algorithm is embedded in the population's structure algorithm to enhance the local search and exploitation of the search space. These algorithms are implemented in two search stages. In the first stage, the algorithm captures a global view of the search space. In the second stage, the algorithm focuses on the most promising area to perform a successive process of local search. As exploration/exploitation balance is guaranteed using these models and the premature conversion is avoided. In [4], Zawbaa developed a novel hybrid GWO-ALO system that exploits the GWO global search ability and Ant Lion Optimization algorithm (ALO) local search performance. In [65], Mafarja developed a hybrid model based on BALO and hill-climbing techniques called HBALO. A new hybrid algorithm was presented in [108] by combining the Clonal Selection Algorithm (CSA) with the Flower Pollination Algorithm (FPA). CSA was good in exploitation, while FPA was good in exploration via Levy flight. In [109], the Mine Blast Algorithm (MBA) was used to support the exploration phase. MBA was integrated with simulated annealing to optimize a local search in the exploitation phase to get closer to the optimal solutions. Ibrahim in [110] designed a hybrid SSA-PSO model. He integrated the update strategy of PSO into the structure of SSA so that the update for the current population was done by using either the SSA or PSO depending on the quality of the fitness function. PSO-mGA (micro Genetic Algorithm) model was presented in [101]. The ACO-DE model was developed in [23]. A novel SA-MFO model was presented by Sayed in [111]. The use of SA was to make the conversion rate slower, to reach to the global optima, and escape the local minima. A new MFO-based hybrid model was developed in [112] by combining MFO and Levy FA (LFA) algorithms. The other target of NIA-NIA hybridization is to refine the best solutions by implementing the NIAs sequentially as a pipeline where the operators of the first algorithm applied first then the operators of the other integrated algorithms are applied sequentially. These models often suffer from being slow in the search process. This hybridization strategy was applied in [113] to develop the PSO-GA model. In [46], the WOA-SA model was developed. In WOASA-1 (Low-Level Team-work Hybrid (LTH)) SA was used as an operator in WOA to enhance the exploitation. In WOASA-2 (High-Level Relay Hybrid (HRH)) SA was used after WOA to enhance the final solution. In 2020 [114], SA was hybridized with the HHO algorithm and AND and OR bitwise operations. SA was used to flee the HHO optimizer from local minima in the feature search space. A new hybrid binary version of the Bat Algorithm (BA) is suggested to solve feature selection problems. In [115], BA was hybridized with an enhanced version of the DE algorithm to reach the global solution. Hybridizing different NIAs to perform parallel exploration for the search space was also a primary target for other studies. Each algorithm generates its initial population and iteratively explores and evaluates the feature subsets. Using this strategy increases the speed of the search process. ACO-GA is an example of these hybrid models [116,117]. Recently, in [118], an enhanced hybrid approach using GWO and WOA was proposed to alleviate the drawbacks of both algorithms.
Another target for NIA-NIA hybridization is to enhance the initialization of the search using different NIAs. In these models, one algorithm is used to generate the initial solutions. Then the other combined algorithm is used to update these solutions. An example of these models is GA-IGWO presented in [119]. In [120], the hybridization of two Immune Firefly Algorithms (IFA1 and IFA2) was proposed. In IFA1, the FFA and Artificial Immune System (AIS) are used simultaneously to increase the global search of fireflies and select the best feature subset. IFA2 was used to study the influence the initial population on the searching progress of the AIS algorithm.

NIA-Classifier Hybridization
Hybridizing different classifiers such as SVM, Artificial Neural Network (ANN), aided Radial Basis Function (RBF), Optimum Path Forest (OPF), bagging, and Bayesian statistical with NIA for evaluating the solutions. Since classifiers have different capabilities regarding the training speed, computation complexity, and generalization capability; many studies investigated their influence when used in the wrappers framework. Other studies tried to make simultaneous FS and parameter optimization to enhance the performance of a classifier. NIA in these hybrid models works as a tuner to optimize the training parameters set up and select the optimal feature subset. In [121], a new wrapper approach was built to perform parallel FS and optimization for SVM parameters by exploiting the merits of MVO. Another hybrid model was presented in [122] for optimizing the SVM parameters simultaneously with selecting the best feature subsets using a GOA optimizer.

NIA-Filter (Wrapper-Filter) Hybridization
The filter-wrapper hybrid model is applied in two ways. First, a filter is applied to eliminate redundant and irrelevant features, minimize the dimensionality, and produce a reduced data set that is ready to be used by a wrapper. The second way to apply the filter-wrapper model is to use the filter in the structure of a wrapper to evaluate the generated features subsets. In [123], the Information gain and correlation-based were integrated with BPSO in models called IG-IBPSO and CB-IBPSO, respectively to solve FS. In [17], a MSPSO-F-score was developed. A mutual information filter was integrated with PSO and presented as a model called MI-PSO in [124]. PSO-MI and PSO-Intropy were developed in [125]. CS-MI was developed in [126]. BALO with QuickReduct and CEBARKCC filtering approaches were developed in [65]. In [5], IWOA-IG was developed. The ACO-MI model was presented in [127]. ACO with the multivariate filter was presented in [16]. The GA-MI model was presented in [128], GA-IG in [18,129], and GA-entropy in [130]. In [131], Relief-f was used with DE to rank the most significant features. Lately, in [132], an Embedded Chaotic Whale Survival Algorithm (ECWSA) has been proposed as a wrapper process and a filter method. In [133], an efficient hybrid model based on a combining filter and evolutionary wrapper approach was proposed for sentiment analysis of various topics on Twitter. The classification system was based on a SVM classifier and two FS methods using the ReliefF and MVO algorithms. Authors in [134] proposed a filter wrapper approach using a Sequential Floating Forward Search (SFFS) to acquire features for activity recognition. The model was validated using a benchmark dataset with a multiclass Support Vector Machine (SVM). The results show that the system is affected even with limited hardware resources.

Update Mechanism
The update modification aims to achieve a balance in exploration/exploitation processes. The update strategy is performed by either enhancing the update process of individuals or dynamically control the NIA parameters. A new variant of ACO was presented in [25,135]. The update strategy used performance and the number of selected features as heuristic information for ACO with no need for prior information about features. In [136], the gbest was updated based on some conditions. This strategy determines when to reset the gbest based on several epochs (iterations) in which the value of the gbest did not change. The same strategy was applied in [24,90]. Martinez, in [137], claimed that the initialization procedure and the update of all particles are not beneficial in high dimensional space. Hence, only a small subset of particles is randomly selected to be updated. The update for a particle is carried out by filling it with active features from the current particle, local best, and global best. This strategy was applied to the original PSO to get a new variant called CuPSO.
In [138], a new rule to update particle's positions was proposed. Instead of the original rule in BPSO that lies in giving equal probabilities to either selecting or not selecting a feature. P(x d i (t) = 0) = P(x d i (t) = 1) = 0.5 where x d i (t) is the gene in the d dimension of the position vector at iteration t. The new rule was introduced to increase the probability of x d i (t + 1) = 0 and reduce the probability of x d i (t + 1) = 1. The idea in [139] is that pbest is usually updated based on the fitness value. However, if the new position has the same fitness value as the current pbest, then the pbest will not be updated even if the new solution corresponds to a smaller feature subset. This is a limitation of PSO. The proposed PSO was to update pbest and gbest into two stages where the priority is given first for the classification accuracy. Next, if the new particle position has the same performance as the current pbest but the number of features is smaller, then in this case, pbest will be updated and replaced by the new position.
In [96], the objective was to update PSO based on a clustering approach. The new GPSO uses Gaussian distribution. The idea was to group homogeneous features based on interactions between features, then PSO is used to select one representative feature from each cluster. Mafarja in [140] proposed five update strategies for the inertia weight (w) parameter. Linear, non-linear, coefficient, decreasing, oscillating, and logarithm were applied. His idea was based on applying an exploration operator more than exploitation at the beginning of the search then search those regions carefully to find the global optima. The conclusion was that the gradual decrease for the inertia weight (w) either linearly or non linearly improves BPSO. Mafarja in [141] studied the influence of the inertia weight (w) parameter on the performance of BPSO. He suggested the adaptive change for the exploration and exploitation by using a rank-based for updating the inertia weight (w) parameter. The same author presented in [83] the time-varying update strategy to improve the performance of the DA optimizer. In [142], Aljarah applied several asynchronous update strategies to solve the FS problem. An adaptive update strategy based on a descending linear function was used to update the SSA c1 parameter.
Recently in [143], a Binary DA (BDA) was proposed with new mechanisms to update its main coefficients. The main target is to apply the survival-of-the-fittest principle using different functions such as linear, quadratic, and sinusoidal. Three variants of BDA were introduced and compared with the standard DA. The new variants are linear-BDA, quadratic-BDA, and sinusoidal-BDA. Recently, in [144], a time-varying number of leaders and followers in a binary SSA (TVBSSA) with Random Weight Network (RWN) was proposed. In 2020, the CSA algorithm was enhanced in [145] using three enhancement strategies to solve the FS problem: Adaptive awareness probability to balance exploration and exploitation, dynamic local neighborhood to improve local search, and proposing a global search strategy to increase the global exploration of the crow.
In [146], an enhanced Binary Global Harmony Search algorithm, called IBGHS, was proposed to solve FS problems. An improved step is proposed to enhance the global search ability. In [147], a new update strategy based on ranking of the individuals was proposed. Each moth in the MFO algorithm is given a rank based on its fitness value. Therefore, a moth with a small fitness value will have a high rank so that there will be a great change in its position. On the other hand, a moth with a high fitness value will have a small rank so that there will be a small change in its position. This adaptive update strategy enhanced the performance of the optimizer. In [148], a time varying flame strategy was proposed to enhance the MFO algorithm. The number of flames represents the number of the best solution that decreases gradually across iterations. Different mathematical formulas were experimented with to decide the best formula that ensures exploitation around the best solution in the late stages.

Modified Population Structure
Zeng in [105] developed a novel GA with a dynamic chain-like agent population structure. CAGA aimed to enhance the population structure and diversity. This was better than the lattice-like agent population structure where agents do genetic operations just with neighboring agents. In [101], Mistry used a new population structure for PSO-mGA. He used a small-population secondary swarm strategy. A secondary swarm performs a collaborative role to avoid stagnation and overcome premature convergence.

Different Encoding Scheme
Galbally in [149], tried to minimize the verification error rate in the online signature system. Different encoding schemes were used, including binary and integer coding. GA with binary coding was used to search the complete search space. On the other hand, GA with integer coding was used for searching a subset of the search space. GA with an optimized descriptor weight or/and optimal descriptor subset was developed in [150] over MPEG-7. There were three different encoding schemes: A real-coded chromosome for weight optimization, binary-coded chromosome for the selection of optimal feature descriptor subset, and bi-coded chromosomes for simultaneous weight optimization and optimal feature descriptor selection. A new ensemble classifier was proposed in [151]. It was based on AdaBoost learning and parallel GA. A hybrid model parallel-GA-AdaBoost with different encoding schemes BGAFS and BCGAFS was proposed.

New Initialization
In [53], authors developed a hybrid model based on BPSO to solve the FS problem. A new initialization strategy called Opposition-based Learning (OBL) was proposed. The OBL strategy was used to enhance the initialization of particles and enforce diversity among solutions by considering the solution as well as its opposite solution simultaneously. OBL was used also to generate the opposite position of the gbest particle to get rid of the stagnation case. A novel framework based on IGWO and Kernel Extreme Learning Machine (KELM) was developed in [119]. In the GA-IGWO-KELM model, GA was applied first to generate high quality and diversified initial positions, then GWO was used to update the positions of the individuals in the discrete search space. Tubishat, in [5] developed a hybrid model called IWOA-SVM-IG. The OBL strategy was applied for increasing the level of diversity in the initial solutions generated by standard WOA. In [152], a quasi-oppositional learning-based Multi-Verse Optimization (MVO) algorithm was used to improve the initial setting up of solutions.

New Fitness Function
Chakraborty [153], proposed the PSO algorithm where the fitness evaluation of each particle is based on ambiguity. The new fuzzy evaluation function was used to measure the fuzziness of a fuzzy set. The best feature was represented with minimum intraclass ambiguity as well as maximum interclass ambiguity. In [154], GA was proposed with Fisher's Linear Discriminant function in a model called GA-FLD. The new evaluation function estimates the probability distribution of the class in the N-dimensional feature space. It uses also the cardinality of the feature subset using covariance matrices which is an extension of FLD. This method was used to measure the statistical proprieties of the feature subset. Authors in [53] developed BPSO with a new fitness function based on dynamic inertia weight. High inertia weights are assigned to particles with low fitness values to facilitate more exploration of the search space. Low inertia weights are assigned to particles with high fitness to facilitate more exploitation. In [6], GWO was modified using several fitness functions. The fitness functions were accuracy, Hausdorff distance, Jeffries-Matusita (JM) distance, the weighted sum of the accuracy and Hausdorff, and the weighted sum of the accuracy and JM. In [155], different fitness functions were used to enhance the performance of the MFO algorithm. The best fitness function was the one that was applied across two-stages. The first stage optimizes the classification performance only while the second stage takes into consideration the number of genes. The results show that the proposed fitness functions can achieve better classification results compared with the fitness function that takes into account only the classification performance.

Multi Objective
Zio [156], developed a system for nuclear plants based on GA to select among the several measured plant parameters. The first approach applied was single objective GA with fuzzy k-Nearest Neighbor classifier (KNN) then multi-objective approaches were applied. Mandal, in [157] developed a prediction system based on a multi-objective PSO that satisfies the Pareto front and makes a trade-off between the non-dominated solutions based on different objectives. The proposed multi-objective PSO FS algorithm performed a dual-task where the first objective was maximizing the mutual information between a feature and class label (relevance) and the second objective was minimizing the mutual information among the features (redundancy). A Dynamic Locality Multi-Objective SSA for FS was proposed in [158]. In [159], a multi-objective FS method was proposed based on bacterial foraging optimization. In [160], a multi-objective PSO modified by Levy Flight was proposed for intrusion detection in Internet of Things (IoT). RWS mechanism was used to remove redundant features and information exchange mechanisms to avoid local minima. A systematic review of the multi-objective FS problem that covered the related studies in the period (2012, 2019) was introduced in [161].

Parallelism
In [162], Punch applied a wrapper FS based on GA to biological datasets. 5KNN was modified to work on weighted features (multiplied by weights according to their importance). The new approach was applied to a parallel distributed machine (Sparc and HP). A new ensemble classifier was proposed in [151]. It was based on AdaBoost learning and parallel GA. A parallel version of GA was applied on 16 processors with a master-slave paradigm and KNN was used as a base classifier. Ghamisi in [163] applied the parallelism strategy on PSO. Darwinian PSO (DPSO) was based on running many PSO algorithms simultaneously. Each algorithm runs as a different swarm on the same problem. A natural selection process was applied by rewarding the swarm that got better results and extending its particles' life so that new descendants were spawned. On the other hand, the swarm with suboptimal results (stagnate) was punched so its search area was discarded and its life was reduced by deleting its particles.

NIAs FS Applications
This section provides an extensive discussion on the use of modified NIA algorithms in different applications.

Microarray Gene Expression Classification
In [164], a hybrid model of GA and SVM was developed to perform FS and kernel parameter optimization. GA-SVM is a recommended approach for FS especially when the kernel parameters are optimized and the number of selected features is not known beforehand. Huang, in [128], developed a new GA-based wrapper approach. He adopted two stages of optimizations. The outer optimization stage (global search) applied a fitness function based on mutual information between actual classes and predicted classes. The second stage (the inner optimization) implements a local search (filter manner) based on feature ranking. A gene selection approach based on ACO was developed in [165]. A high-dimensional multi-class cancer gene expression (GCM) and colon cancer data sets were used. The comparisons were conducted with several rank-based models. The simulated results proved the validity of the proposed ACO approach for FS in high dimensional data sets.
A reliable FS technique was developed in [136] for selecting relevant features in the gene expression data set. The proposed methodology was IBPSO-KNN. The results of the accuracy increased by 2.85% compared with other methods in the literature. Yang [92], presented a new modified model for BPSO and applied it over six multi-category cancerrelated human gene expression data set. Yang [18] developed a hybrid filter wrapper method for FS in microarray data sets using GA and IG. The ranking of features was performed using a decision tree. Experiments showed that the IG-GA algorithm simplified the number of gene expression levels and either achieved higher accuracy or used fewer features compared to other methods. A hybrid filter-wrapper model based on Information Gain (IG), Correlation-based (CFS), and IBPSO was proposed in [123]. Kabir [166] developed a new hybrid model based on GA, NN, MI, and local search operators. A new PSO model that has the capability of discovering biomarkers from microarray data was designed in [137].
Chuang [52] developed a hybrid model for FS and classification of large-dimensional microarray data sets. Mohammad [138] developed a diagnostic medical model based on IBPSO to find the least possible number of discriminative genes. One year later, Kabir developed an ACO-based FS model in [167]. The ACOFS target was to select the salient features with the smallest size. The model combined the ACO, neural network, filter, and included an update for the rules-based on subset size determination scheme. In [24], the PSO variant was superior to other methods in terms of performance, a number of features, and cost.
A new filter-based approach based on the CS optimizer, mutual Information filter (MI), entropy, and Artificial Neural Network (ANN) classifier was proposed in [126]. The entropy and mutual information were applied in the fitness function to calculate the relevance and redundancy for the feature subsets. Banka developed a new modified version of the PSO algorithm in [98]. Three benchmark data sets were used for colon cancer, defused B-cell lymphoma, and leukemia. The model achieved a minimal number of features and a higher classification accuracy. In [100], a model for cancer gene selection and cancer classification was developed based on BQPSO and SVM with LOOCV. Five DNA microarray data sets were used. Experiments showed better results for BQPSO/SVM compared with BPSO/SVM and GA/SVM in terms of accuracy, robustness, and the number of genes selected. Zawbaa [4], handled the complexity of the FS problem in data sets with large dimensionality and few numbers of instances by developing a novel hybrid system called GWO-ALO. A total of 27 different microarray and image processing data sets were used. Some of the data sets were very complex with 50,000 features and less than 200 instances. The experiments showed promising results when compared with GA and PSO. Ibrahim, in [168], developed a novel wrapper approach based on combining SVM with the GOA optimizer, and then he applied the hybrid model on three biomedical data sets from Iraqi cancer patients and UCI.

Facial Expression Recognition
A new modified ACO-based FS approach without a need for prior knowledge about features was presented in [25]. The experiments were applied to an ORL gray-scale face image database. The same author proposed after one year another ACO-based FS approach [135], which showed superior performance compared with GA-based and other ACO FS approaches. Aneesh, in [76], proposed a new face recognition technology using a modified version of BPSO, called Accelerated BPSO (ABPSO). ORL database images taken at the AT&T Laboratories and Cropped Yale B database-4 were used in the experiments. A biometric technique for Face Recognition (FR) based on BPSO was developed in [97]. Seven benchmark databases, namely, Cambridge ORL, UMIST, Extended YaleB, CMUPIE, Color FERET, FEI, and HP were used in the experiments.
Zhang [112] developed a facial recognition system based on the MFO-LFA-SA hybrid model to avoid premature stagnation and to guide the search procedure towards global optima. MFO logarithmic spiral search behavior increased the exploitation power meanwhile the LFA used the attractiveness function for more exploration in the search space. The SA empowered the exploitation around the most promising solution. Experiments used frontal-view images extracted from CK+JAFFE, MMI, and BU-3DFE. MFO-LFA FS outperformed other facial expression recognition models. Mistry [101] incorporated several update mechanisms in one model including the hybridization of a PSO-and mGA-(micro Genetic Algorithm), modified population structure, new velocity update strategy, diversity maintenance strategy, and a subdimension-based regional facial feature search strategy. Cross-domain images from the extended Cohn Kanade and MMI benchmark databases were used in the experiments besides multiple classifiers including NN with back-propagation, a multi-class SVM, and ensemble classifiers.
In [169], a system for Facial Emotion Recognition (FER) was developed based on GWO-NN. The hybridization was used to tune the weights with less training error, then it classified the emotions from the selected features. The proposed FER system was evaluated using the JAFFE and Cohn-Kanade database and the results showed higher accuracy compared with conventional methods.

Medical Applications
A new recognition system for skin tumor diagnosis was developed by handels in [170]. A GA algorithm was used to extract the most suitable features from 2D images that characterize the structure of the skin surface. NN with back-propagation was used as a learning paradigm that was trained using the selected feature sets. Different network topologies and parameter settings were investigated for optimization purposes and GA was compared with heuristic greedy algorithms. The GA skin tumor achieved the highest classification performance of 97.7%. An optimized mass detection system for digitized mammograms was developed by Zheng [171]. A GA-BBN hybrid model was used to classify positive and negative regions for masses depicted in digitized mammograms. The results showed that GA achieved the same ratio of feature reduction in comparison with the exhaustive search but reduced the total computation time by a factor of 65. In [113], a hybrid PSO-GA FS system was developed to improve the cancer classification performance and reduce the cost of medical diagnoses. Chakraborty [153] proposed a modified version of PSO using a new fuzzy evaluation.
In [172], different hybridization models were developed using the GA algorithm with different neural classifiers to get the best feature subset while preserving accuracy. A comparison was conducted between GA-KNN, GA-BP-NN, GA-RBF-NN, and GA-LQV-NN. The results showed that GA with neural classifiers were more robust and effective. In [173], Babaoglu investigated the effectiveness of both BPSO and GA as FS models for determining the existence of Coronary Artery Disease (CAD). BPSO-SVM and GA-SVM were applied on a data set obtained from patients who had performed Exercise Stress Testing (EST) and coronary angiography. The results showed that the BPSO-FS method was more successful than GA-FS and SVM on determining CAD. An automatic breast cancer diagnosis framework was designed by Ahmad [174]. The developed hybrid Genetic Algorithm Multilayer Perceptron (GA-MLP) model performed simultaneous FS and parameter optimization of ANN.
Three different variations of the backpropagation training algorithm, namely the resilient backpropagation (GAANN-RP), Levenberg Marquardt (GAANN-LM), and Gradient Descent with momentum (GAANN-GD) were investigated. The Wisconsin Breast Cancer Database (WBCD) was used. The experiments showed that the best accuracy was achieved by the RP. Sheikhpour developed a hybrid model to distinguish between benign and malignant breast tumors [175].
PSO-KDE was used to minimize the kernel density estimation error and avoid the time needed by the surgical biopsy. The Wisconsin Breast Cancer Data set (WBCD) and Wisconsin Diagnosis Breast Cancer Database (WDBC) were used. Sayed [176] developed an automatic system based on MFO for Alzheimer's Disease (AD) diagnosis. It was able to distinguish three kinds of classes including Normal, AD, and Cognitive Impairment. A benchmark data set consisted of 20 patients from the National Alzheimer's Coordinating Center (NACC). Experiments showed that the SVM-polynomial kernel function was the best one in terms of accuracy precision, recall, and f-score. A novel medical diagnosis framework based on IGWO and KELM was developed in [119].
The model was investigated on Parkinson's and breast cancer disease data sets. The comparison was performed between IGWO-KELM, GWO-KELM, and GA-KELM. The experimental results proved that the proposed method was better than the other two competitive counterparts. One year later, Sayed developed a new approach for mitosis detection in breast cancer histopathology slide images based on the MFO FS algorithm [177]. MFO was used to extract the best discriminating features of mitosis cells such as statistical, shape, texture, and energy then the selected features were used to feed the Classification and Regression Tree (CART) to make classification into either mitosis and non-mitosis. Wang [3] developed an efficient medical diagnosis tool based on CMFO and KELM to minimize the number of features and to perform parameters optimization for KELM.

Handwritten Letter Recognition
The target in [178] was to study which one of the machine learning algorithms had the right bias to solve specific natural language processing tasks. GA achieved the best results on a language processing WSD data set. In [89], authors developed a hybrid GA to mitigate the weakness of standard GA in fine-tuning near the local minima. The proposed approach was validated using a data set gained by extracting the gray-mesh features from the CENPARMI handwritten numeral samples. Galbally [149] tried to find a way to minimize the verification error rate in the online signature verification system. A GAbased approach with new modification was proposed. Experiments were conducted on the MCYT signature database with 330 users and 16,500 signatures. The new approach showed remarkable performance in all the carried out experiments. Zeng [105] developed a novel GA with a dynamic chain-like agent population structure and dynamic neighboring competitive selection strategy. He used a letter-recognition database from UC Irvine (UCI). The experimental results showed that the feature subset generated from CAGA achieved a higher classification rate, more stability, and lower classification complexity in comparison with the other four GAs. A novel FS algorithm based on ACO was presented in [179] to improve the performance of the algorithm in text categorization. Comparisons were conducted with GA, information gain, and Chi Square test (CHI) on the Reuters-21578 data set. The proposed approach proved its superiority concerning the Reuters-21578 data set. In [129], Principal Component Analysis (PCA) was used with the IG filter method and GA optimizer in a model called IG-GA-PCA. In the first stage, the IG method was applied to rank the terms of the document according to their importance. In the second stage, GA and PCA FS and feature extraction methods were applied separately to the ranked terms. Experiments used both Reuters-21578 and Classic3 data sets. The experiments showed that the IG-GA-PCA model could achieve high categorization results as measured by precision, recall, and F-measure. In [154], a GA-FLD-based FS approach was used in order to find features subsets that could optimally discriminate samples from different classes without prior knowledge about features dimensionality. Another modification based on fitness function were also proposed. Three standard databases of handwritten digits and one of handwritten letters were used in the experiments. In [53], authors developed a hybrid intelligent algorithm using BPSO and other operators to solve the FS problem in the text clustering. A new initialization strategy, new fitness function, and new operator were proposed. The Reuters-21578, Classic4, and WebKB benchmark text data sets were used. The results showed higher clustering accuracy and improved the convergence speed of BPSO. Ewees, in [180] introduced a new approach for Arabic handwritten letter recognition (AHLR) called MFO-AHLR. A data set for Arabic handwritten letter images (CENPARMI) was used. Results showed that MFO-AHLR achieved a 99.25% accuracy, which was the highest ratio achieved among all AHLR approaches. Tubishat, in [5], developed a novel hybrid model for Arabic SA. The targets of the study were to mitigate the limitations of the WOA such as local minima, slow convergence diversity, and over-fitting problems. A hybrid model IWOA-SVM-IG was applied over four Arabic benchmark data sets for sentiment analysis. IWOA was compared with six well-known optimization algorithms and two deep learning algorithms, namely Convolution NN (CNN) and Long Shortterm Memory (LSTM). The results showed that the IWOA algorithm outperformed all other algorithms.

Hyper Spectral Images Processing
Tackett in [181] worked on extracting the statistical features from a large noisy US Army NVEOD Terrain Board imagery database using GP. In [182], a new model was proposed based on GA, Bayesian classification, and a new proposed fitness function to discriminate the targets from clutters in SAR images. Jarvis [183] developed a novel approach based on GA and DFA for the selection of important discriminatory variables from Fourier Transform Infrared (FT-IR) spectroscopic data. The GA achieved 16% reduction in the model error. The GA-SVM model for hyper-spectral data classification was proposed in [184]. The proposed GA-SVM was tested on an HYPERION hyper-spectral image. Experiments demonstrated that the number of bands was reduced from 198 to 13, while accuracy increased from 88.81% to 92.51%. A GA-based image annotation system with optimized descriptor weights or/and optimal descriptor subset over MPEG-7 was developed in [150]. The Corel image database consisted of 2000 images with 20 categories used. Experiments showed that the binary-coded GA and the bi-coded GA improved the accuracy of the image annotation system by 7%, 9%, and 13.6%, respectively compared to the commonly used methods.
A new ensemble classifier was proposed in [151]. It was based on AdaBoost and parallel GA in the context of the FS problem for image annotation in MPEG-7 standard. The experiments were performed over 2000 classified Corel images. In [185], a new approach based on GA, SVM, MI, and BB was developed to search for the best combination of bands in the hyper spectra images. MI was used as a pre-processing step for band grouping based on the correlation between bands and classes. GA-SVM was used to search for the optimal combinations of bands that increase accuracy. A post-processing step based on BB was used to filter out those irrelevant band groups. Ghamisi [163] applied the FODPSO SVM approach to determine the most informative bands in the Hekla and Indian Pines hyper-spectral data set using the parallelism modification technique. In the same year, Ghamisi [186] presented a new hybrid approach based on GA, PSO, and SVM. His target was to detect roads from a background in complex urban images. He integrated the standard velocity and update rules of PSO with selection, crossover, and mutation from GA. In [6], Medjahed developed a novel GWO framework for Pavia and AVIRIS hyper-spectral images data sets.

Protein and Related Genome Annotation
In [116], a new FS model based on ACO-GA was proposed. Both ACO and GA generated the feature subsets in parallel then the generated subsets were evaluated by a certain fitness function. ACO used GA operators to update the solutions. The GPCR-PROSITE dataset and ENZYME-PROSITE challenging protein sequences data sets were used. Mandal [157], developed a prediction system to identify the possible subcellular location of a protein-based on a multi-objective PSO.

Biochemistry and Drug Design
Raymer [187] developed a system that integrates FE, FS, and classifier training using GA and KNN. This approach was applied in biochemistry and drug design for the identification of favorable water-binding sites on protein surfaces. The approach was validated using protein water interactions from a biochemistry field. Another model was developed by Salcedo [104]. The proposed FS model was based on GA and m-features operator (OR operator). The new approach was evaluated using two machine learning classification problems; the first one used two artificial data sets and the second one was a real application in molecular bioactivity for a drug design taken from the ones used in the KDD Cup. THe m-features operator improved the GA performance over the other existing approaches.

Electroencephalogram (EEG) Application
Palani [188] used GA and Fuzzy ARTMAP (FA) NN for FS. GA-FA-NN was used with the VEP data which was recorded from 10 alcoholics and 10 controls. The target was to classify alcoholics and controls, using multi-channel EEG signals. The discriminatory spectral bands reduced from 7 to 2. The identification of useful spectral power ratios produced better performance. In [189], a hybrid GA-SVM model was used to extract the favorable patterns from noisy multidimensional time series obtained from EEG which are a base for Brain-computer Interfaces (BCIs). The data set was collected by a procedure in which subjects were placed in a dim, sound controlled room. The proposed nonlinear system was better than other linear approaches with a slight difference. A novel ACO-DE FS system called ANTDE was presented in [23]. It could cope with the limitations of ACO regarding the sequential generation for solutions. ANTDE was used in EEG and Myoelectric Control (MEC) biosignal applications. Wang [130] developed a BCI system using a hybrid model GA-SVM-entropy and 28 EEG channels. Noori designed an effective BCI in [190]. He used a new version of GA based on SVM to get smaller optimal features from functional Near-infrared Spectroscopy signals (fNIRS). The experiments were established by recruiting seven subjects who do not have any psychological disorder. Subjects were seated in a quiet room and asked to relax to settle down their responses before beginning to perform mental arithmetic tasks for a certain period.

Financial Prediction
In [191], a new financial prediction model was proposed. A hybrid model SVM-GA was evaluated using 15 business data sets. Each data set consisted of 186 sampled firms. GA-SVM achieved a prediction accuracy of up to 95.56% for all the tested business data. In [192], the authors developed a hybrid fuzzy-GA approach for stock selection. The fuzzybased scoring mechanism was applied for scoring a set of stocks then the topmost stocks were selected. GA applied for performing a dual job of FS and parameter optimization. The constituent stocks of the 200 largest market capitalization listed in the Taiwan Stock Exchange were used in the experiments.

Software Product Line Estimation
Oliveira [22] investigated the use of the GA method for simultaneous FS and parameters optimization of Support Vector Regression (SVR) when applied for software effort estimates. GA, SVR, MLP, and model trees were used. Six benchmark data sets of software projects, namely, Desharnais, NASA, COCOMO, Albrecht, Kemerer and Koten, and Gray were used in the experiments. In [106], Guo presented a new methodology for FS in the software application. The target of the new modified GA was optimizing FS in a Software Product Line (SPLs) to find a feature subset with an optimal product capability subject to feature model constraints and resource constraints. The results showed that GA FS algorithms produced a system with high performance and in 45-99% less time than existing heuristic FS techniques.

Spam Detection in Emails
Temitayo [193] developed a new approach for the classification of emails, either spam or legitimate. GA was used to perform simultaneous FS and parameter optimization. The hybrid GA-SVM spam detection model was evaluated using a Spam Assassin (6000 emails) data set. Experiments showed that GA-SVM improved the results compared with SVM by achieving a higher recognition rate with only a few feature subsets. In [85], a mutation-based BPSO FS model was developed in an email application. A data set of 6000 emails manually collected during the year 2012 was used. The proposed was able to effectively reduce the false-positive error. In [194], a hybrid GA-RWN was used for identifying the most relevant features in spam emails and automatic tuning for the hidden neurons. The GA-RWN achieved promising results according to the spam detection rate and optimization for the configuration of its core classifier. Lately, in [195], a novel Northern Bald Ibis Algorithm (NOA) was used with a SVM classifier to get an optimal feature subset of the Enron-spam dataset.

Other Various Applications
Zio [156] developed an efficient transient diagnosis system for nuclear power plants based on GA to select among the several measured plant parameters. In [61], the target was addressing the problem of a lengthy Intrusion Detection (ID) process based on attributes of network packets. Rough-PSO was used and evaluated using the KDDCup 1999 data set. An automatic FS model that can choose the most relevant features from password typing patterns was designed in [196]. The data sets were captured on a Sun Sparc-Station by a program in an X window environment in which the keystroke duration times were measured. Rodrigues in [197] proposed a CS-OPF model for theft detection in power distribution systems. The proposed model was evaluated using two data sets from a Brazilian electrical power company. Experiments proved the robustness of the CS-OPF model by increasing the theft recognition up to 40%. Zhang [127] developed a new forecaster FS model based on combining MI and ACO. The ACO-MI model was applied on forecasters data sets at the Australian Bureau of Meteorology. A system for diagnosing different types of fault in a gearbox was designed in [198]. Hassanien [67] developed an automatic tomato disease detection system based on integrating rough set with MFO.

An Open Source Evolopy-FS Framework
EvoloPy-FS [199] is an open-source FS software tool developed by our team and it is publicly available on (www.evo-ml.com). It serves as an explicit white-box NIAs-FS optimization framework. The main objective was to support researchers from different disciplines with an easy-to-use, transparent, and automated NIAs-FS optimization tool. The framework contains severe recent NIAs algorithms written in Python and a set of different operators such as transfer functions (S-TFs and V-TFs). Moreover, the framework applies wrappers, filters and a hybrid filter-wrapper, different evaluation metrics, and allows for loading data from different resources. Evolopy-FS is a continuation of our path, which is building an integrated optimization environment. The work was started by EvoloPy [200] for global optimization problems then EvoloPy-NN for optimizing MLP and recently Evolopy-FS for optimizing the feature selection process. In [199], authors constructed the experiments based on 30 different well-regarded data sets from common repositories such as UCI and Kaggle. The comparisons were conducted between wrapper FS, filter FS, and hybrid filter-wrapper approaches. It was shown that wrapper and hybrid filter-wrapper were superior and more trustable in dealing with large dimensionality data sets. However, the filter approach was faster and generated results in a shorter time and fewer computational efforts.

Assessment and Evaluation of NIAs FS Modification Techniques
As discussed, NIAs-FS approaches achieved big contributions and clear success in solving the FS problem in different domains. This section presents the results from the analysis of modified NIAs-FS studies. Table 1 shows a summary of the main studies in the literature that adopted new operators as modification techniques for NIAs-FS, Table 2 shows a summary of the main studies in the literature that adopted hybridization modification technique for NIAs-FS, Table 3 shows a summary of main studies in the literature that adopted the remaining modifications techniques for NIAs-FS, Table 4 shows a summary of main modifications applied in the literature on main NIAs (applied/not applied), Table 5 shows a summary of main modifications applied in the literature on main NIAs (by numbers), Table ?? shows a summary of the main studies in the literature that applied modified NIAs FS in applications, and Table 7 shows a summary of modifications applied on NIAa-FS in the main applications. It was observed that 34 different operators were applied on NIAs wrappers in 48 different papers. Some references adopted over one operator in their work. As it is clear also from the list, the most applied operator is the chaotic map, which was applied in 10 references, then rough set in 6 references, then selection operators (RWS, TS) in 5, then S-shaped and V-shaped transfer functions and crossover in 4 references. The mutation was applied in 3 and UC, DE, and local search operators each were been adopted by 2 references. A single reference adopted the remaining operators. It was found that the PSO wrapper was the most modified optimizer using newly adopted operators for tackling the FS problem. It was modified using a new operator in 21 references. In addition, GA was modified in 6 references, WOA in 4, CS in 3, SSA in 3, GWO, GOA, and MFO each one was modified by a new operator in 2 references. For DA, FFA, LA, BA, MVO, ABC, CSO, DE, and CSA, the number of references was 1. For FPA and ACO, no work applies new operators to their algorithms for solving the FS problem.
It was clear from the analysis that the hybridization modification technique was applied in 75 references to solve FS. This counting result shows that hybridization is the most widely applied modification technique to enhance NIAs wrappers in the FS domain. This high number of work comes from GA, which is the NIA that had the most number of works regarding wrapper hybridization. GA wasapplied hybridization in 38 different works, which is much higher than 6, the number of works that adopted new operators to GA. We can infer from this works count and from the contribution of studies that hybridization is the best suitable modification technique to be applied with GA. ACO also were hybridized in 7 references, while no work adopted a new operator to modify ACO. Conversely, PSO hybridization works were 11, which is less than 21, the number of works with new operators, thus we can again infer from these counts and the contribution of studies that adopting a new operator to PSO is more suitable than hybridization. It was also noticed that hybridization using different kinds of classifiers was the most prominent hybridization technique.
There are 49 studies that tried to investigate the influence of the classification technique on the performance of wrappers for optimizing FS, some of these studies applied simultaneous optimization for FS and a classification/prediction task by tuning the parameters of the classifier with applying FS. The next widespread hybridization technique is a filter-wrapper, which was applied in 14 studies and was very effective in dealing with large dimensionality feature space. Hybridization techniques that tried to balance the exploration/exploitation of the search space also were adopted by a considerable number of works. In summary, PSO and GA are the most widely modified NIAs-FS approaches. They were equally modified and used. Each one of them was adopted and modified for FS in 56 references of the gathered studies.
On the other side, regarding applications of NIAs-FS, it was evident that microarray gene expression classification is the most dominant application where NIAs-FS approaches were applied in 18 studies with a ratio 24% concerning other applications. The medical application was the second prominent application for applying NIAs-FS approaches with a ratio of 21%. The medical application includes different medical branches SONAR, tumor, mass, and various disease detection, medical diagnosis, medical data, and bio-signal analysis. Then, follows hyper-spectral image with a ratio of 17%, Arabic handwritten recognition with a ratio of 13%, facial expression recognition with a ratio of 9%, EEG application with a ratio of 7%, financial diagnosis with a ratio of 5%, and spam detection with a ratio of 4%. Furthermore, it is noticeable that GA is the most dominant NIA optimizer for optimizing FS in applications with a ratio of 45%. PSO is the second most widespread optimizer with a ratio of 26%, then ACO with a ratio of 11%, MFO with a ratio of 7%, GWO with a ratio of 6%, CS with a ratio of 2%, WOA with a ratio of 2%, and GOA with a ratio of 1%.

Conclusions and Future Research Directions
In this study, a survey about modifications of NIAs for tackling the FS optimization problem is presented. The review is based on a solid theoretical, applied, and technical foundation. Three main research streams are identified in this review: Meta-heuristic optimization, feature selection, and modification on NIAs for tackling FS. This review aims to draw the map for researchers and guide them when creating new research in this area. This survey is based on 156 articles collected and studied on modifications of NIAs for solving the FS problem. The sources of the information search came mainly from six well-regarded scientific databases: Elsevier, Springer, Hindawi, ACM, World scientific, and IEEE. From the review, it can be seen that the NIAs algorithms have been extensively investigated over the past years to improve the FS problem. About 34 different operators were investigated. The most popular operator is chaotic maps. Hybridization is the most widely used modification technique. There are three types of hybridization: Integrating NIA with another NIA, integrating NIA with a classifier, and integrating NIA with a classifier. The most widely used hybridization is the one that integrates a classifier with the NIA. Microarray and medical applications are the dominated applications where most of the NIA-FS are modified and used. Despite the popularity of the NIAs-FS, there are still many areas that need further investigation: • Until now, there are few works in the binary optimization field. Many new operators can be proposed to enhance the performance of binary optimizers in a binary space. This is an interesting research direction; • The proposed enhanced binary versions of optimizers can be used as a data mining tool in various applications. There are some applications where the usage of modified NIAs-FS in them is still limited; • It would also be interesting to look at the dimensionality and number of instances in data sets. Nowadays, the majority of FS works to address problems with dimensionality up to several thousand but the question that may arise is what will happen if the data sets scaled up to millions of features? There is a scalability gap that should be addressed in the future; • There is still room for improvement through parallel NIAs-FS. This might be a fruitful direction for research; • Hyper volume Pareto optimal dominance and many-objective optimization need further crucial investigation.
Based on the above trends, the size of the NIAs-FS research area can be recognized. Besides, it can be imagined that a thorough investigation and improvement of NIAs will improve the FS process in various high-dimensional areas. This review paper will be used to help researchers take an excellent view of the modification strategies in nature-inspired algorithms for tackling the feature selection problem.