Improved Binary Grasshopper Optimization Algorithm for Feature Selection Problem

The migration and predation of grasshoppers inspire the grasshopper optimization algorithm (GOA). It can be applied to practical problems. The binary grasshopper optimization algorithm (BGOA) is used for binary problems. To improve the algorithm’s exploration capability and the solution’s quality, this paper modifies the step size in BGOA. The step size is expanded and three new transfer functions are proposed based on the improvement. To demonstrate the availability of the algorithm, a comparative experiment with BGOA, particle swarm optimization (PSO), and binary gray wolf optimizer (BGWO) is conducted. The improved algorithm is tested on 23 benchmark test functions. Wilcoxon rank-sum and Friedman tests are used to verify the algorithm’s validity. The results indicate that the optimized algorithm is significantly more excellent than others in most functions. In the aspect of the application, this paper selects 23 datasets of UCI for feature selection implementation. The improved algorithm yields higher accuracy and fewer features.


Introduction
Recent years have witnessed a spurt of the development of informatics, and the data scale of applications such as statistical analysis and data mining is becoming larger and larger. Accordingly, the number of features obtained from the dataset is also increasing. However, some features may be irrelevant or redundant, independent of the final classification goal [1]. Therefore, it is necessary to reduce the dimension of the data and obtain representative features before the classification task. Data preprocessing can smooth out noisy and incomplete data, detect redundancy, and have strong robustness. As an essential preprocessing function, feature selection can clean and remove useless data features effectively [2]. Thus, FS plays an essential role in dimensionality reduction and improving classification performance.
FS is an effective strategy to reduce dimensionality and eliminate noisy and unreliable data [3]. It refers to finding feature-related subsets from a large set of attributes. There are 2 N − 1 possible feature subsets in a dataset with N features. Davies proves that the search for the smallest subset of features is an NP problem, which means there is no guarantee of finding an optimal solution other than an exhaustive search [4,5]. However, when the number of features is large, an exhaustive search cannot be applied in practical applications because of the large amount of calculation. Therefore, people are committed to using a heuristic search algorithm to find the suboptimal solution. Many studies have attempted to model feature selection as a combinatorial optimization problem. The objective function can be classification accuracy or some other criterion that considers the best trade-off between the number of extracted features and efficiency [6].
The meta-heuristic algorithms are used to find the optimal or satisfactory solution to complex optimization problems [7][8][9]. The principles of optimization algorithms are revealed through knowledge of relevant behaviors and experiences in biological, physical, and other system domains. In 1991, an Italian scholar proposed the theory of ant colony optimization (ACO) [10]. Since then, swarm intelligence has been formally proposed as a theory. Swarm intelligence takes advantage of group information. It has been extensively used in optimization problems. In 1995, some scholars presented the particle swarm optimization (PSO) algorithm [11], and then the research on this subject was carried out rapidly. The cat swarm optimization based on feline predation strategies was introduced in 2006 [12]. In 2010, fish migration optimization (FMO) emerged, which integrated migration and swim models into the optimization process [13]. In 2017, Saremi et al. proposed the grasshopper optimization algorithm (GOA) [14]. GOA solves optimization problems by mathematically modeling and simulating the behavior of grasshopper swarms in nature. Compared with other existing algorithms, GOA has a higher search efficiency and faster convergence speed. It also solves the continuous problem of finding the best shape for a 52-bar truss and a 3-bar truss. Over recent years, more sophisticated algorithms have been put forward, such as the sparrow search algorithm (SSA) [15], seagull optimization algorithm (SOA) [16], quasi-affine transformation evolution with external archive (QUATRE-EAR) [17], and polar bear optimization algorithm (PBO) [18].
Nonetheless, many optimization problems are discrete problems, such as FS. Conventional methods can not satisfy practical needs, so binary algorithms are needed to solve this problem. Up till now, scholars have proposed many binary algorithms and achieved quite fruitful results. Among them, the well-known PSO algorithm and its binary variants have been put into feature selection [19][20][21][22]. A binary whale optimization algorithm was presented to handle discrete problems in this work [23]. Binary fish migration optimization algorithm (ABFMO) [24] and improved binary symbiotic organism search algorithm (IBSOS) using transfer function also solved the FS problem [25]. Accordingly, the pigeon flock optimization algorithm (PIO) and the gray wolf optimization algorithm (GWO) were improved for better application in feature selection [26][27][28]. The pigeon flock optimization algorithm simulates the pigeons' homing behavior. Based on the binary pigeon flock optimization algorithm (BPOI), Tian et al. proposed improved binary pigeon-inspired optimization (IBPIO) [29]. They offered a new speed update equation and finally achieved excellent results. Additionally, the binary approach enabled the GWO to be applied to discrete problems [30,31]. The novel gray wolf optimization algorithm (BGWO) added a new parameter update equation to enhance the search capability [32]. Besides, the author gave five transfer functions for the feature selection of UCI datasets. Beyond that, the binary version of GOA was also used to solve the FS problem [33]. Hichem et al. proposed a Novel Binary GOA (NBGOA) by modeling position vectors as binary vectors in [34]. Pinto et al. [35] developed a binary GOA based on the percentile concept for solving the Multidimensional Knapsack Problem (MKP). Moreover, BGOA-M, a binary GOA algorithm based on the mutation operator, was introduced for the FS problem [36].
The sigmoid transfer function is a common method used when converting algorithms to binary versions [37,38]. Some scholars suggested improved binary EO (BEO) for FS problems using the sigmoid function [39]. The authors presented binary MPA (BMPA) and its improved versions using sigmoid and eight transfer functions [6]. In the binary grasshopper optimization algorithm (BGOA), the authors used the sigmoid transfer function to convert space to binary [36]. It has been well applied in feature selection. However, there is a weakness in the original BGOA. The conversion probability of position only accounts for a small range, which can not satisfy the exploration requirement of the algorithm. Thus, this paper presents an improved BGOA to avoid this situation. In the first place, the improved BGOA optimizes the step size in the original BGOA. Secondly, two sigmoid-based and one V-shaped transfer function are proposed based on the new step size. To evaluate the effectiveness of the improved algorithm, 23 well-known datasets are used for experiments. For the performance analysis, the improved BGOA is compared with BGOA, BPSO, and BGWO. Experiments prove that the proposed algorithm performs excellently than the original BGOA in the FS problem. There are the main contributions: 1. The range of step size variables in the original BGOA is optimized. 2. Three new transfer functions and two position conversion formulas are proposed based on the new step size. 3. The efficiency of the improved algorithm is examined by several experiments on 23 benchmark functions [40]. 4. The improved algorithm achieves satisfactory results in feature selection application.
The rest of this paper is shown below. Section 2 is the preliminaries, which contain GOA and the original BGOA. Section 3 presents the improved version of BGOA. Section 4 shows the effect of the improved BGOA on 23 benchmark functions. Section 5 describes the application of the improved BGOA to feature selection. Section 6 analyzes the results of feature selection. Section 7 gives a discussion of this paper.

Preliminaries
GOA has been maturely applied to continuity problems. Its binary variants have also been gradually refined. This section introduces the standard GOA and the BGOA based on the sigmoid transfer function.

GOA
Grasshoppers are incompletely metamorphosed insects, consisting of three stages: egg, worm, and adult. Grasshoppers are a worldwide agricultural pest and generally occur individually. Nevertheless, they are swarming organisms that excel in periodic population outbreaks and can migrate over long distances. Grasshoppers are usually found in the worm and adult stages. The adult grasshoppers have solid hind feet, which causes tremendous damage to agriculture, forestry, and animal husbandry. They are adept at jumping and flying through the air with an extensive wide range of movement. In addition to migration, grasshoppers are also characterized by their predation process. Nature-inspired optimization algorithms have two phases: exploration and exploitation. Exploration is a large-scale search to prevent falling into a local optimum, while the exploitation phase is a small-scale search to find the optimal solution [41,42]. Grasshoppers can instinctively perform these two steps to find the target. Furthermore, according to the grasshopper's characteristics, GOA has a unique adaptive mechanism. It can effectively regulate the global and partial search process with higher search accuracy. This phenomenon is mathematically modeled by Saremi et al. [43]: here X i represents the position of the i-th grasshopper at this time. S i represents the influence factor of two individuals, G i is the gravitational influence. A i is the wind influence. Each operator is multiplied by a random number from 0 to 1 to enhance the randomness, as shown in Equation (2): The details of S i for the social interaction operator are as below: where d ij is the distance between the i-th and j-th grasshoppers, function s calculates the intensity of social interaction, d ij = x j − x i d ij is the unit vector between the i-th and j-th grasshoppers. Where x i and x j represent the positions of the i-th and j-th grasshoppers, respectively. The s function is defined as shown in Equation (4): where f is the attraction strength, l is the attraction length scope. The value of s is negative to indicate mutual repulsion, while positive indicates mutual attraction between grasshoppers. 0 means that they are in their comfort zone. The value of f is 0.5, and the value of l is 1.5. When two grasshoppers are too far apart, the force does not exist. Therefore, the distance has to be normalized. In his paper, the author does not take gravity into account. The wind direction is toward the best value. The final position formula is shown in Equation (5). The Pseudocode of GOA is given in Algorithm 1.
Algorithm 1 Pseudocode of GOA 1: Initialize C max , C min (two extreme values of parameter c), Max_iter (iterations' maximum) and N (population of grasshoppers) 2: Initialize the position of each grasshopper: X i (i = 1,2, . . . , n) 3: Set the best solution as Target 4: while t ≤ Maxiter do 5: Update c with Equation (6) 6: for each agent do 7: Normalize the distance among two individuals to [1,4] 8: Update X i using Equation (5) 9: Update Target, if a better value is obtained 10: end for 11: t = t + 1 12: end while 13: Output Target The ub d and lb d are the boundary values of the d-th dimension, respectively. T d is the optimal value found so far, and N represents the number of populations. An explanation of the calculation of c in Equation (5) can be found below: where C max is the maximum and C min is the minimum. The t is a number of the current iterations. T is the total iterations. It should be easy to see that c becomes smaller as the number of iterations increases. The first c can narrow the search area around the target with the increased number of iterations. The second c is used to reduce the gravitational force or repulsion between grasshoppers. In the text, C max = 1, C min = 0.00001. From Equation (5), we can know that the new position of the i-th grasshopper is not only related to its current position but also related to the current situation of all other grasshoppers and the interaction forces between individuals. The adaptive mechanism of the algorithm can balance the global and local search. It has an excellent optimizationseeking ability.

BGOA
The search space of GOA is continuous. Thus the position can be moved randomly. However, in binary space, the position can only take 0 or 1. Mafarja et al. used the sigmoid transfer function in the paper to implement the binary conversion: here ∆X t is the first part of Equation (5), similar to the velocity variable in the PSO algorithm, which is called step size. The absolute value of ∆X t can be considered as the distance between the updated position of the grasshopper and the target position in the d-th dimension. A conversion probability is obtained based on the transfer function. Accordingly, the formula for updating the grasshopper's position is also changed through Equation (7) and Equation (8): where r 1 belongs to [0, 1], X d t+1 is the position of d-th dimensional after the t-th iteration.

Analysis and the Improvement of Binary Grasshopper Optimization Algorithm
The standard GOA algorithm and its modified versions have worked and achieved good results on continuous problems. Feature selection can be seen as a binary problem of selecting the appropriate 0/1 string. The initial length of the string is the whole amount of features in the original dataset. 0 represents the unselected attribute, 1 for a selected attribute. Additionally, The transfer function is a common and classical method in converting continuous to binary space [44].
The original BGOA used step size and transfer function for binary conversion to obtain specific results. From the analysis in Section 2, we know that parameter c and the step size become smaller as the number of iterations increases. After debugging the code and preserving the decimal places, the range of step size is found in [−0.3, 0.4], which indicates that the conversion probability is always taken to be a small part of [0, 1]. The curve is shown in Figure 1. Beyond that, the parameter r 1 in Equation (8) is a random number; thus, it may not be conducive to the position update in the former exploration stage. The ideal result is that the individuals in the population can randomly transform their positions in the binary pattern. To avoid the situation, this paper improves the performance of BGOA by modifying the transfer function. This manuscript proposes a new step size variable and three improved transfer functions. The first two transfer functions are based on the sigmoid transfer function, and the third is a V-shaped function.
("X t ) The step size is modified to consider both the range of population positions and the uniformity of particles falling around 0.5 to ensure fairness. When the step size takes a value close to 6, the conversion probability is nearly 1. Therefore, increase the step size ∆X to 20 times, and change the range to [−6, 6]. The new transfer functions are proposed based on the new range. These transfer functions have two extremes close to 0 and 1 on [−6, 6], which has strong randomness in converting the binary position. The range on both sides of the 0 point is also evenly distributed. Here we set B to 20.
When the grasshopper updates its position, the transition probability is obtained according to Equation (9), which we refer to as BGOAS1: Equation (10) is called BGOAS2 : and Equation (11) is called BGOAV: The new position is derived according to Equation (12) or Equation (13): From the above description, we can learn that the new position of the grasshoppers depends on the current position of all grasshoppers. It is finally derived from the position conversion probability. Compared with the existing BGOA, the proposed methods in this paper have better exploration ability and randomness. New transfer functions are shown in Update c with Equation (6) 5: for each agent do 6: Normalize the distance among two individuals to [1,4] 7: Calculate probability using Equation (9) or Equation (10) or Equation (11) 8: Update X i using Equation (12) or Equation (13) 9: Update Target, if a better value is obtained 10: end for 11: t = t + 1 12: end while 13: Output Target

Experimental Results
The validity of the new algorithm is verified in this section. There are many excellent test functions like benchmarks in the BBOB workshop, which support algorithm developers and practitioners alike by automating benchmarking experiments for black-box optimization algorithms [45][46][47]. This manuscript uses 23 benchmark test functions to demonstrate the effectiveness of the proposed algorithm. Among them, f1-f7 are unimodal benchmark test functions, f8-f13 are multimodal benchmark test functions, and f14-f19 are fixed-dimension benchmark test functions. The details of each test function are presented in Tables 1-3. Space means the search space of the population, Dim is the function's dimension, and TM is their theoretical optimum. The settings of all parameters required by the algorithms are in Table 4.  The improved algorithm is compared with BGOA, BPSO, and BGWO. The mean and standard deviation (std) of the test functions are given in Table 5 and 6. If the improved algorithm works better than or the same as the original one, then we put the good result in bold font. For example, for f12, the values obtained using BGOAS1, BGOAS2 and BGOAV are smaller than BGOA, so the first three values are indicated in bold font.
As can be seen, the improved algorithm has an obvious advantage over BGOA, BPSO, and BGWO in the mean values of fitness results obtained on the first 13 test functions. The result indicates that the improved algorithm is more effective in solving highdimensional problems. On the fixed-dimension functions, values obtained by the six algorithms are almost the same. It illustrates that the improved strategy is not the most efficient for addressing the low-dimensional problem.
For the unimodal test functions, there is only one optimal solution. Consequently, they can effectively check the convergence rate of the algorithms. Tables 5 and 6 show that the results of the proposed algorithm BGOAV outperform the compared algorithms in all seven unimodal functions. The mean and std are the smallest. In f2, f3, and f6, BGOAS1 and BGOAS2 also obtain the optima of considered algorithms. If the improved algorithm works better than or the same as the original BGOA, then we put the good result in bold font. Functions f8-f13 are multimodal test functions. These functions have many local optima and are suitable for testing the ability of the algorithm to avoid local optima. BGOAS1, BGOAS2, and BGOAV perform well in these functions. BGOAV outperforms the other algorithms in both the mean and standard deviation of the results. Functions f9 and f11 reach the theoretical optimum with BGOAS1, BGOAS2, and BGOAV. As to f8 and f12, the proposed methods are closer to the optimum than BPSO and BGWO. Moreover, for f13, the best result is obtained using BGOAV, BPSO, and BGWO. In other words, the proposed strategy to improve the step size produces good results and prevents the algorithms from falling into local optima.
Functions f14-f23 are the fixed-dimension functions. It is evident from the results that the mean and standard deviation obtained by all algorithms are almost the same. Only f20 does the BGOA get a value closer to the theoretical one. It proves that on the fixed-dimension functions, the new algorithm has no special advantage over BGOA, BPSO, and BGWO. It is due to the low-dimensional and simple structure of the function, while the improved strategies are better at high-dimensional and complex problems.
To judge whether the results of the improved strategies differ from the best results of the other algorithms, Wilcoxon rank-sum test and Friedman test were performed at a 5% significance level in this experiment. We assume that there are no significant differences between the algorithms. If the p-value is smaller than 0.05, the hypothesis is not valid, and the original hypothesis is overturned. It can be identified from Table 7 that for f1-f4, the p-values obtained by BGOAS1, BGOAV, BGWO, and BPSO are smaller than 0.05, which means that there is a significant difference between BGOAS1, BGOAV, and BGOA. Data in Table 8 show that the p-value is not greater than 0.05 in f1-f3 and f5-f13, which could be considered strong evidence against the null hypothesis. The data suggests that there is a significant difference between these algorithms. This result illustrates new algorithm is superior to BGOA, BPSO, and BGWO in these 12 functions. It can be argued that the improved methods outperform the compared algorithms overall.
It is easy to see that the improved strategies promote the exploration and exploitation of BGOA. Moreover, it heightens the competitiveness of the algorithm in finding optimal solutions to functions. In the next section, this paper applies the improved algorithm to a real problem to study its practicability in FS.

Application of Feature Selection
Feature selection is a major function in the pre-processing part of data mining. It can remove irrelevant and redundant data from the dataset [48]. Researchers usually focus on the method with high precision and low features. In this section, the improved strategies (BGOAS1, BGOAS2, BGOAV) are exploited in feature selection for classification problems. It can be found that the improved strategies obtain better results and yield more accurate subsets of features.
Twenty-three datasets are selected for feature selection from UCI machine learning [49], each with different attributes and instance data. In addition, this paper uses a wrapperbased method for feature selection. The detailed information of the 23 datasets is introduced in Table 9.
K-Nearest Neighbor (KNN) classification algorithm is the most commonly used classification algorithm in data mining [50]. KNN is a supervised learning method with a simple mechanism: given a testing sample, find the K nearest training samples based on some distance metric, and then use these K "neighbors" to make predictions. Typically, voting can be used to classify the test samples with the most frequent of the K neighbors into one class. The distance metric between different samples generally selects Euclidean distance or Manhattan distance [51]. The computational method is shown in Equation (14): where p is a variable constant. When p = 1, L represents the Manhattan distance, and if p = 2, L refers to the Euclidean distance. The x i and y i represent two different instances in the set, respectively. The basic idea of cross-validation is to split the original data into training and testing sets [52]. The former is used for training the model, and the testing set is used for model validation. K-fold cross-validation divides all initial samples into K equally sized subsets. Then traverse the K subsets in turn. Each time the current subset is used as the verification set, and all other samples as the training set to train and evaluate the model. Finally, the average value of K evaluations is taken as the final evaluation criterion of the model. 20 is the maximum value of K. Generally, 10 is sufficient [53].
The error rate and accuracy of classification are crucial evaluation indicators in classification prediction. This paper uses Equation (15) as the fitness function: where errate (KNN) denotes the classification error rate after K-fold cross-validation, which is explained in Equation (16). Parameter µ is often taken as 0.99. SeF is the subset feature after feature selection, AlF is the number of features for the dataset: Enum and Cornum are the error and the correct number of the classification, respectively. It can be seen from Equation (15) that the fitness function aims to find the combination of features with maximum classification performance and a minimum number of selected features. It is converted into a minimization problem by using the error rate instead of the classification accuracy and using the selected feature ratio instead of the unselected feature ratio.

Results of Feature Selection
The improved algorithm, BGOA, BPSO, and BGWO algorithms are applied to feature selection. All the population sizes are set to 30. The iterations are 100 and run 15 times on each dataset. The value of K in KNN is taken as 10. Table 10 shows the feature selection fitness values. Table 11 records the number of feature selections. Table 12 describes the accuracy of the feature selection. Wilcoxon rank-sum test and Friedman test are examined for the mean accuracy and fitness values in Tables 13 and 14. Table 10 shows that the new strategies have great advantages. The improved strategies obtain better results than the original BGOA on 15 datasets. In Air, Astura, Breast, and Segmentation datasets, new strategies outperform BPSO. Only in 5 datasets does the original BGOA obtain a better value. In Appendicitis, Breast, Bupa, Diabetes, and Glass datasets, the original BGOA gets the best result. The number of selected feature subsets presented in Table 11 also supports the claim that the improved algorithm has better performance than compared algorithms. It is worth mentioning that BGOAS1 performs well in the number of selections, with the smallest subset of features selected in 14 datasets. On all 23 datasets, the improved algorithm obtained better or equal results than BGOA, BPSO, and BGWO. The accuracy of feature selection is shown in Table 12. BGOAS1 achieves exceptionally high accuracy on the 8 datasets: Balancescale, Bupa, Cloud, Diabetes, and Heartstatlog datasets. Compared with the original algorithm, the accuracy of feature selection is improved by about 3%. Accordingly, BGOAS2 and BGOAV obtain higher accuracy than BGOA on 10 and 6 datasets. Among them, the accuracy of the Vowel dataset reaches 1. On the Air, Appendicitis, Breast, WDBC, and Zoo datasets, the performance of the six algorithms is comparable.   Table 14, it can be obtained that values in WDBC, Bupa, Segmentation, Jain, Vowel, and Sonar datasets are smaller than 0.05. Therefore, it can be considered that there are significant differences between these algorithms. The results in these tables prove the better validity and feasibility of the improved algorithm. If the improved algorithm works better than or the same as the original BGOA, then we put the good result in bold font.

Discussion
The binary grasshopper optimization algorithm solves discrete problems such as feature selection. This paper presented three improved versions of the binary grasshopper optimization algorithm for feature selection. A new step size variable and three transfer functions were introduced to optimize the algorithm's exploration capability in binary space. Besides, this paper has done several tests on 23 benchmark test functions to certify the algorithm's feasibility. The improved algorithm shows preferable performance in high-dimensional functions. Subsequently, simulation experiments for feature selection are conducted. In the 23 UCI datasets, the KNN and 10-fold cross-validation are adopted to address the wrapper-based feature selection problem. The improved algorithms are more competitive than the original BGOA, BPSO, and BGWO regarding fitness values and selected subsets.
It should be noted that the method of this paper has applied only to feature selection. Thus, it can address other binary combinatorial optimization problems, including task scheduling and traveling salesman problems. Apart from that, many excellent benchmarks in the BBOB workshop may be very effective for the further improvement of BGOA. Hence, more in-depth studies like using benchmarks in the BBOB workshop to examine the algorithm will be conducted in the future. Finally, the improved algorithm does not perform well on low-dimensional functions, and the binary conversion increases the computing time. Future work involving shortening the running time of the algorithm and improving its ability to solve low-dimensional problems is expected to execute.