Analyzing Physics-Inspired Metaheuristic Algorithms in Feature Selection with K-Nearest-Neighbor

: In recent years, feature selection has emerged as a major challenge in machine learning. In this paper, considering the promising performance of metaheuristics on different types of applications, six physics-inspired metaphor algorithms are employed for this problem. To evaluate the capability of dimensionality reduction in these algorithms, six diverse-natured datasets are used. The performance is compared in terms of the average number of features selected (AFS), accuracy, ﬁtness, convergence capabilities, and computational cost. It is found through experiments that the accuracy and ﬁtness of the Equilibrium Optimizer (EO) are comparatively better than the others. Finally, the average rank from the perspective of average ﬁtness, average accuracy, and AFS shows that EO outperforms all other algorithms.


Introduction
Data mining is the process of finding meaningful information or extracting knowledge from large amounts of data.Data mining has the challenging problem of dealing with huge data dimensions.When working with data that has a large number of dimensions, even the advantages of technology can be a hassle [1].The data-mining process may suffer due to a huge number of dimensions.It may also require a lot of computing time and space.Traditional machine-learning (ML) methods cannot handle these huge datasets [2].The dataset is made up of several samples that collectively give information about a specific case of the problem.Each sample has a variety of attributes or features.The dataset may have several superfluous or duplicate attributes, in addition to its huge dimensionality.The model may be complex, and the dataset may include a substantial amount of noise.The best subset of the useful features that will contribute to the output is chosen via a pre-processing technique called feature selection (FS) [2].FS can reduce the training time as well as the huge number of dimensions in the data.Moreover, the model's accuracy is enhanced in addition to the simplification of the model and the best utilization of computing resources [3].
The two main FS approaches are wrapper methods and filter methods.The major drawback of the filter methods is that they work independently of the ML classifiers and do not take any input from them [4].Meanwhile, the wrapper method uses the classifier directly and picks the features using an optimization algorithm [5].Optimization algorithms provide the advantage of choosing an optimal or nearly optimal subset of features in a reasonable amount of time as opposed to the conventional exhaustive search.An exhaustive search becomes impractical, because it finds the solution by creating all feasible feature subsets (2 m different solutions for m features) [6].In the literature, optimization algorithms are categorized into several groups, such as evolution-based algorithms, swarm-based algorithms, human behavior-inspired algorithms, physics-inspired algorithms, etc. [7].Swarm-based algorithms mimic the collective but decentralized intelligence of living creatures, such as birds [8], wolves [9], whales [10], bacteria [11], etc. Evolutionary algorithms mimic the emergence of the fittest and healthiest individuals over generations.
In recent years, metaphor-based algorithms have extensively been used to solve FS problems from different domains.Examples include feature selection using Particle Swarm Optimization (PSO) for document clustering [20], the use of a real-valued Grasshopper Optimization Algorithm (GOA) for feature selection [21], hybridization of the Whale Optimization Algorithm (WOA) and Simulated Annealing (SA) for the feature selection problem [22], feature selection for intrusion detection in wireless mesh networks incorporating genetic operators in WOA [23], the incorporation of levy flight and opposition-based learning in chaotic Cuckoo Search (CS) for feature selection [24], feature selection using Moth Flame Optimization (MFO) [25], feature selection using the Firefly Algorithm (FA) [26], the hybridization of SA with Harris Hawk Optimization (HHO) for the feature selection problem [27] and feature selection using binary Teaching-Learning-Based Optimization (TLBO).
According to the No-Free-Lunch (NFL) theorem [28], no single optimization algorithm is capable of solving every optimization problem by outperforming all other optimization techniques.Because of this, one optimizer can perform better than the others on some problems, but not on all of them.Hence, it is crucial to compare several optimization algorithms on a variety of datasets to find the optimum solution to the feature selection problem.Since there are hundreds of optimization algorithms in the literature, in this study, a few well-known and highly cited physics-inspired algorithms are chosen for this purpose.The rationale is to carry out a comparison of the various metaphors drawn from physics and evaluate their effectiveness.To evaluate the performance of these algorithms, six smallto-large-sized classification datasets are used.The accuracy, convergence, and average fitness of these algorithms are compared.This paper has the following contributions:

•
The main novelty of our paper lies in its comparative analysis of six well-cited physicsinspired metaphor algorithms for the problem of feature selection.

•
To the best of our knowledge, this is the first time these physics-inspired algorithms have been compared for this specific problem, and our findings provide valuable insights into their performance.

•
Our study also has broader implications for the field of machine learning and data mining, as it helps to shed light on the effectiveness of different optimization algorithms for feature selection.

•
Our work contributes to the growing body of research on metaheuristics and their potential applications in machine learning and data mining, and it highlights the potential value of using physics-inspired optimization algorithms for feature selection.

•
Additionally, our use of variable-sized classification datasets allows us to assess the applicability of these algorithms on a wide range of problems, making our results more generalizable and applicable to practitioners.
Overall, we believe that our paper represents a significant contribution to the field and has the potential to impact the way practitioners approach the problem of feature selection.The rest of the paper is organized as follows.The methodology is discussed in Section 2. Section 3, namely, the Results and Discussion, covers the results and comparative analysis of all six algorithms, and the concluding remarks are given in Section 4.

Wrapper Method for Feature Selection
For feature selection, we employed a wrapper method.To accomplish their task, wrapper techniques use a learning algorithm that applies a search strategy to explore the space of feasible feature subsets, ranking them according to the quality of their performance in a specific algorithm.In most cases, wrapper approaches outperform filter methods, since the feature selection process is tailored to the specific classification algorithm being employed.Wrapper methods, on the other hand, are prohibitively time-and resourceintensive for high-dimensional data, since they require evaluating each feature set with the classifier algorithm.Figure 1 depicts the way in which wrapper methods function.
Appl.Sci.2023, 12, x FOR PEER REVIEW 3 of 20 Section 2. Section 3, namely, the Results and Discussion, covers the results and compara tive analysis of all six algorithms, and the concluding remarks are given in Section 4.

Wrapper Method for Feature Selection
For feature selection, we employed a wrapper method.To accomplish their task wrapper techniques use a learning algorithm that applies a search strategy to explore the space of feasible feature subsets, ranking them according to the quality of their perfor mance in a specific algorithm.In most cases, wrapper approaches outperform filter meth ods, since the feature selection process is tailored to the specific classification algorithm being employed.Wrapper methods, on the other hand, are prohibitively time-and re source-intensive for high-dimensional data, since they require evaluating each feature se with the classifier algorithm.Figure 1 depicts the way in which wrapper methods func tion.In this paper, K-Nearest Neighbor (k-NN) is used as the evaluator algorithm.The k NN method uses a set of K neighbors to determine how an object should be categorized A positive integer value of K is pre-decided before running the algorithm.To classify a record, the Euclidean distances between the unclassified record and the classified records are determined and ranked.

Fitness Function
The effectiveness of an optimizer is evaluated by its fitness function.The fitness func tion in feature selection is dependent on the classification error rate and the number o features used for classification.It is deemed to be a good solution if the selected feature subset reduces the classification error rate and the number of features chosen.The follow ing fitness function is used in this paper [29]: where   () is the classification error computed by the classifier, || is the reduced number of features in the new subset, || is the total features in the dataset, and   [0, 1 In this paper, K-Nearest Neighbor (k-NN) is used as the evaluator algorithm.The k-NN method uses a set of K neighbors to determine how an object should be categorized.A positive integer value of K is pre-decided before running the algorithm.To classify a record, the Euclidean distances between the unclassified record and the classified records are determined and ranked.

Fitness Function
The effectiveness of an optimizer is evaluated by its fitness function.The fitness function in feature selection is dependent on the classification error rate and the number of features used for classification.It is deemed to be a good solution if the selected feature subset reduces the classification error rate and the number of features chosen.The following fitness function is used in this paper [29]: where γ S (D) is the classification error computed by the classifier, |S| is the reduced number of features in the new subset, |F| is the total features in the dataset, and λ [0, 1] is a factor corresponding to the importance of the classification performance and length of the reduced subset.

Physics-Inspired Metaphor Algorithms
In this paper, six well-cited physics-inspired metaphor algorithms are employed to solve the problem of feature selection.In this section, the functioning of these algorithms and their position-updating mechanisms are discussed.

Simulated Annealing
Simulated Annealing is a fundamental nature-inspired algorithm that was proposed in 1983 by Kirkpatrick et al. [30].The source of inspiration behind this algorithm is the annealing process of metals.The process of annealing, which starts at a very high temperature and progressively cools down, is used to physically harden metals.The algorithm involves three main parameters, including the cooling rate (c), the final temperature (T f ), and the starting temperature (T 0 ).The starting temperature is kept very high initially, and the cooling rate gradually reduces until it reaches the final temperature.The process is mimicked by randomly generating a candidate solution.The algorithm runs iteratively, and a new solution is generated in the neighborhood of the current solution in each iteration.The fitness of the current and neighbor solutions is compared.If the fitness of the new solution is better, then the position of the current solution is updated.Moreover, the best solution keeps the best position found so far.The terminating condition of the repetitive process is reaching the T f .In each iteration, T is updated as follows: SA is a global optimization algorithm, because it can explore as well as exploit the search space.The exploration is performed by updating the current solution with a worse neighboring solution in early iterations based on the value of T and the worse value of the neighboring solution.The chance of accepting the worse neighbor is computed using the following equation: where exp is the exponential function, δ is equal to the fitness difference of current and neighboring solutions, and r is randomly generated in the range [0, 1].

Gravitational Search Algorithm
This algorithm is inspired by Newton's law of gravitation and the second law of motion [17].It treats each candidate solution in the search space as an object whose mass is considered to be its fitness.Heavier objects are considered fitter than lighter objects.The objects are attached to each other with some gravitational force that causes objects to explore the search space.The heaviest object is considered the global best solution.Since the heavier objects attract other objects with more force, the whole population ultimately converges toward the heaviest object, called the global best solution.The algorithm is comprised of a few mathematical equations that are expressed below.
Force calculation: The force from an object j on an object i is calculated using the following equation: In the above equation, G is the gravitational constant that controls the search accuracy, M pi is the passive gravitational mass of solution i, M aj is the active gravitational mass of solution j, the distance between solution i and solution j is denoted by R ij , x d is the position of a solution in d th dimension, and is a small constant.
The ultimate force on a solution (mass) is calculated by taking the weighted sum of all the forces on that solution from the k best solutions, which are calculated as follows: Acceleration calculation: Once the total force on a solution in a particular dimension d is calculated, the acceleration of the solution in that dimension can be computed using the following equation: where M ii is the mass of inertia of solution i. Velocity calculation: Based on the acceleration, the velocity of a solution can be computed by adding the acceleration to a fraction of the previous velocity of that solution.The equation to compute velocity is given below: Position updating: To update the position of a solution, the updated velocity is simply added to the old position of the solution, as formulated below: Gravitational constant updating: To update G, the following relation is used: where G 0 is the initial gravitational constant, and α is a constant.t and t max represent the current and final iteration numbers.

Sine Cosine Algorithm
The Sine Cosine Algorithm (SCA) [31] has a very unique source of inspiration.It utilizes two sine and cosine functions to update the position of solutions when searching the space to find the global optimum.The position-updating model of this algorithm is very simple and is formulated below: In the above equation, X i denotes a solution in the i th dimension, and P i denotes the global best solution, namely, the destination solution in the paper.The above equation involves a few other variables that are defined below.r 1 is an adaptive parameter that is linearly reduced with the course of iterations.It starts from a prefixed value and linearly decreases in each iteration.It is computed as follows: where α is constant.
• r 2 is randomly generated in the range of 0 to 2π. • r 3 is also a random number that is generated in the range of 0 to 2. • r 4 is also a random number that is generated in the range of 0 to 1, and based on its value, it is decided whether to use the sine function or the cosine function in updating the position of the current solution.
When multiplied by r 1 , the range of values provided by sin(r 2 ) and cos(r 2 ) shifts from [−1, 1] to [−2, 2].Due to a linear decrease in the values of the parameter r 1 , the range begins at [−2, 2] and linearly declines to [0, 0] during iterations.The positionupdating equation of SCA creates two regions around the destination P: an inner region that promotes exploitation and an outer region to promote exploration.The precondition to search the inner region is and the precondition to search the outer region is {r 1 X cos(r 2 )}, or {r 1 X sin(r 2 )} gives a value greater than 1 or lesser than −1.

Atom Search Optimization
Atom Search Optimization (ASO) [32], which is inspired by molecular dynamics, has shown a tremendous performance on a variety of applications in the literature.Each atom is considered a candidate solution, and the mass is mapped with the fitness in the optimization algorithm, where the higher the mass, the fitter the solution.Every atom in the population pulls or repels other atoms in the search space.The heavier atoms generate more force and pull lighter objects rapidly, and the heavier objects are pulled slowly towards the others due to their mass.The slowly moving atoms create exploitation in the algorithm, because they can search more locally, whereas the rapidly moving atoms allow the algorithm to explore the search space because of longer and quicker jumps.The algorithm starts with random initializations.In every iteration, the atoms move and accelerate, and the location of the atom that has performed the best up to that point is also likewise adjusted.Atomic acceleration is also caused by two other factors: L-J potential and constraint forces.The acceleration helps to update the velocity of the solutions (atoms).Finally, the velocity is added to the previous position to update the current position of the solution.The position-updating mechanism of the algorithm is discussed below.
The population is generated by randomly generating position and velocity vectors for each atom in the population.
The fitness of each solution in the population is computed, and the global best X best is determined.
The mass of each atom is computed using the following equation: where M is computed from the fitness of the current solution, the best solution, and the worst solution.
The value of K is computed, where K denotes the size of the subset of atoms: where N is the size of the population.
The interaction force on an atom is calculated, which is accomplished using the following equation: where rand j is a random number in the range of [0, 1].The constraint force is computed using the following equation: where λ is the Langrangian multiplier that is computed as follows: where β is the multiplier weight.Once the mass, constraint forces, and interaction forces are computed, the acceleration is computed as follows: Once the acceleration is computed, the velocity of an atom can be computed as follows: Using the updated velocity, the position of a solution is updated as follows: 2.3.5.Henry Gas Solubility Optimization Henry Gas Solubility Optimization (HGSO) is inspired by Henry's gas law [33], which is stated below: "At a constant temperature, the amount of a given gas that dissolves in a given type and volume of liquid is directly proportional to the partial pressure of that gas in equilibrium with that liquid".This law can be interpreted as the partial pressure of a gas and the solubility of that gas being directly proportional.If one increases, then the other increases, too.This relation is expressed through the following equation: where the gas solubility is denoted by S g , Henry's constant is denoted by H, and the partial pressure of the gas is represented by P g .The proportionality constant H is highly dependent on the temperature, as it varies with the change in the temperature.In HGSO, each gas particle is considered a candidate solution, whereas all particles collectively make up the population.Initially, gas particles (population) are randomly generated, and then gas particles update their positions in the course of iterations by exploring and exploiting the search space.HGSO involves the following steps.Population initialization: A population of N gas particles is randomly generated using the following equation: where X i denotes the initial position of the i th solution, X {min} and X {max} are the lower and upper bounds of the problem function under consideration, r is a randomly generated real number between 0 and 1, and t is the iteration number.The properties of each search agent in HGSO can be initiated using the following equation: where H j(t) represents Henry's constant for the j th cluster, P {i,j} denotes the partial pressure of the i th particle in the j th cluster, and C j indicates the initial constant value for the j th cluster.
Clustering: This step divides the search agents into K clusters to map different types of gases, where the same types of gases are grouped into a cluster.Therefore, each cluster has the same value of Henry's constant H j .
Fitness Evaluation: In this step, each search agent in the j th cluster is evaluated through the objective function to find the best solution X j,best in the j th cluster.Once all the clusters are evaluated, then the gases are ranked to find the global best particle X best .Update Henry's coefficient: The partial pressure of each gas particle changes in each iteration.Therefore, the value of Henry's coefficient H j is updated using the following equation: where H j represents the value of Henry's constant for the j th cluster, T indicates the temperature, T 0 denotes a reference temperature equivalent to 298.15 K, and t {max} represents the maximum iterations.Update solubility: In this step, the solubility S {i,j} of the i th particle in the j th cluster is updated using the following equation: where K is a constant, and P {i,j} is the partial pressure of gas i in cluster j.
Update position: The properties of particles computed in the previous steps are utilized to update the position of the i th gas particle in the j th cluster according to the following equation: where the position of the i th search agent in the j th cluster is represented by X ij , the best agent in the j th cluster is denoted by X j,best , and the global best particle in the entire population is represented by X best .Moreover, r 1 and r 2 are two random values in the range [0, 1], t is the current iteration, F is a flag used for diversification purposes and changes the direction of the solution, γ indicates the ability of the i th particle in the j th cluster to interact with other agents in its cluster, a represents the impact of other gases on the i th particle, β is fixed as β = 1, F i,j is the fitness of the i th particle in the j th cluster, and F {best} is the fitness of the best particle.Escape from local optimum: To avoid stagnation in local optima, all the particles are evaluated, and the worst N w agents are selected and reinitialized using the following equation: where N is the population size.Moreover, c 1 and c 2 are constants that define the percentage of worst particles.

Equilibrium Optimizer (EO)
Control volume mass balance models, which are used to estimate both dynamic and equilibrium states, serve as an inspiration for the Equilibrium Optimizer (EO), a recently proposed physics-inspired algorithm [34].The particles are considered to be the solutions, and their positions map the concentration of the particles.This algorithm constructs an equilibrium pool of five reference solutions (four best so-far particles and one arithmetic mean of them) called equilibrium candidates.Each particle updates its position with reference to a randomly selected candidate from the pool.The algorithm is aided by two carefully designed parameters called the exponential term (F) and the generation rate (G).Moreover, a concept of memory saving is used, which allows a solution to update its concentration only if it improves as compared to its previous concentration.The exploration, exploitation, and the balance between them are controlled through these parameters: the equilibrium pool and the generation probability.EO uses a mass-balance equation to describe the conservation of mass within a system.The generic mass-balance equation is given as: where V dC dt represents the rate of change of mass in a control volume, Q is the flow rate, the concentration at an equilibrium state is denoted by QC {eq} , and G mimics the mass generation rate.Here, dC dt can also be solved in terms of Q V and Q V denoted by λ or the turnover rate (i.e., λ = Q V ).Therefore, the above equation can be reconstructed as: By taking the integration of the above equation, we obtain: which is used as an updating rule for each particle, where F is calculated as follows: where t 0 and C 0 represent the initial start time and concentration.In this algorithm, each particle is a solution, and its position represents its concentration.The mathematical formulation of EO is discussed in the following steps.
Initialization and function evaluation: The first step is to initialize the particles' concentration according to the following equation: where X {init} {m} represents the initial concentration of the m th particle, X {max} shows the maximum, and X {min} shows the minimum values.
Equilibrium pool and candidates X e : In this algorithm, four equilibrium candidates (good solutions) are determined to guide other particles and promote exploration.Moreover, a particle constructed by taking the arithmetic mean of all these candidates is also used, which promotes exploitation.These candidates are then assembled to form an equilibrium pool.Each particle updates its position with respect to a randomly selected candidate from the pool.
Exponential term (E): This term is used in position updating to balance exploration and exploitation.The exponential term is computed as follows: where t and t 0 are computed by following equations, respectively: In the above equation, a large value of a 1 promotes exploration, and a large value of a 2 promotes exploitation.The sign({r} − 0.5) controls the direction of exploration and exploitation.Using the above equations, E is computed as follows: Generation rate: In this algorithm, this term is used as a solution finder by taking short steps.The generation rate control parameter shows the probability of the generation term in the updating process.The generation probability is used to calculate the number of particles that employ the generation term to readjust their state.The final position updating of EO based on all the above steps is formulated below:

Results and Discussion
In this section, the performance of the previously discussed algorithms is compared on six well-known datasets.All the algorithms are implemented in MATLAB v2019.The experiments are run on a Windows platform with an Intel(R) Core (TM) i7 CPU @3.40 GHz and 24 GB RAM.

Datasets
To evaluate the performance of the algorithms, six datasets, namely, breast cancer, German, heart, ionosphere, ovarian cancer, and sonar are used.To evaluate the performance from different aspects, we have tried to include mixed types of datasets, including smallfeatured (heart disease) to large-featured (ovarian cancer) and small-sized (sonar) to largesized (German) datasets.The details regarding all datasets are given in Table 1.Breast cancer dataset: In this dataset, fine needle aspirate of a breast mass digital image is used to calculate features.The image's cell nuclei are characterized in terms of their appearance and location [36].The diagnosis, i.e., the response parameter, is binary (M = malignant, B = benign).
German dataset: Prof. Hofmann created the original dataset, which consists of 1000 entries and 20 categorical/symbolic attributes.Each record in this dataset is an individual who has been extended credit by a financial institution.People are ranked as either "good credit risks" or "bad credit risks" based on several factors.
Heart dataset: Cleveland, Hungary, Switzerland, and the Long Beach V database are the four parts of this 1988 dataset.It has a total of 76 traits, including the response attribute, but only 14 have been used in any of the published trials.Whether or not the patient has a cardiac disease is what is being targeted in the "target" section.Zero (0) indicates the absence of disease, and a value of 1 indicates the presence of disease.
Ionosphere dataset: The radar equipment that gathered this data is located in Goose Bay, Labrador.The total transmitted power of this system is on the order of 6.4 kilowatts, and it is generated via a phased array of sixteen high-frequency antennas.Radar returns with a clear ionospheric structure are considered to be of high quality.Those that do not are considered "bad" returns, since their transmissions are unable to attenuate the ionosphere.All of the 34 features are continuous.
Ovarian cancer dataset: There are a total of 216 patients included in this dataset, 121 of whom have ovarian cancer and 95 of whom do not.There are almost 4000 individual spectroscopic readings provided for each patient, each of which expresses a biomarker.Patients are likely to share many genes and biomarkers due to the substantial correlation between high-dimensional biological and genetic datasets.
Sonar dataset: The 111 patterns in this dataset were created by reflecting sonar sounds off a metal cylinder at different angles and in different environments.It includes 97 more patterns discovered in rocks exposed to the same conditions.The sonar signal being sent out is a frequency-modulated chirp that gradually increases in pitch.For the cylinder, the dataset includes signals collected at angles spanning 90 degrees, and for the rock, signals were collected at angles spanning 180 degrees.

Parameter Settings
No algorithm may perform well without optimizing its parameters.However, finding the right parameters for an algorithm is itself an optimization problem.To obtain the maximum of each algorithm, several combinations of parameters were tried, in addition to using the combinations suggested by the authors of these algorithms in their original papers.It was also kept in mind that the number of iterations or function evaluations remains the same for all algorithms.The parameter settings used for each algorithm are presented in Table 2. To assure consistency in results, each algorithm was run 20 times in a row on all datasets.Furthermore, each dataset was split into training and test sets, where 80% of samples were used for training, and 20% of samples were used for testing.For classification, the k-Nearest Algorithm (KNN) was used.The KNN classifier is a well-liked wrapper method due to its straightforward implementation and the fact that, in comparison to other classifiers, it only requires one parameter, k.The value of k is tuned by performing several experiments, and it was found that k = 5 is the most suitable value.

Performance Evaluation
To evaluate and compare the performance of each algorithm, different experiments were performed, and the performance in terms of fitness, accuracy, mean feature subsets, and convergence was compared.

Fitness Comparison
The fitness of a solution is the value that is returned by the objective function against that solution.It measures how good or bad a solution is.The best fitness attained by each algorithm for each dataset is presented in Table 3.As can be seen from the results, EO outperformed all the other algorithms in five cases.However, the performance of HGSO was comparable in some cases (DS1 and DS5) and was even better in the case of DS4.Based on collective performance, EO can be given the first rank, and HGSO can be given the second rank.To have a clear picture and better understanding, the average fitness and standard deviation of each algorithm on all datasets are compared in Figure 2. The average depicts a slightly different picture.The average fitness of EO is no longer at rank 1 on DS1, DS3, and DS6; however, EO retained its rank in the case of DS2 and DS5.Furthermore, EO secured the first rank for the DS4, on which HGSO was better in the case of best fitness.On the contrary, HGSO could not maintain its performance.However, ASO maintained its performance on DS1 and outperformed the other algorithms on DS6.It is important to mention here SA was the worst performer in nearly all cases.

Comparison of Classification Accuracy
In this subsection, the classification accuracy and the average number of features selected by each algorithm are compared.The accuracy is the percentage of correctly classified test samples, and the average features selected (AFS) is the average number of features to which an algorithm reduces the dimensions of a dataset.AFS is computed by taking the average of the number of features selected by an algorithm in all 20 runs.The best accuracies obtained by each algorithm on all datasets are presented in Table 4.It is evident from the results that EO outperformed all other algorithms on all datasets.However, ASO performed equally well in three cases: DS1, DS5, and DS6.Simulated Annealing, on the other hand, was the worst performer on all datasets.It is important to note that the easiest dataset to classify was DS5, because all algorithms attained 100% on this dataset; however, SA was the only one that had an accuracy of less than 100% on this dataset.In this analysis, DS2 remained the toughest benchmark, on which the best performer could not obtain more than 82% accuracy.
In addition to comparing best accuracies, the average accuracy of all runs along with standard deviations are compared in Figure 3.When comparing best accuracies, EO outperformed on all datasets, but due to inconsistent performance in different runs and a slightly higher standard deviation, EO was no longer the best performer on DS1, DS3, and DS6.However, its performance was comparable.On the other hand, ASO managed to outperform EO on DS1 and DS6, and both were equally good at DS2.Moreover, the performance of GSA was noticeably the best on DS3, whereas SA was still the worst performer on all datasets.

Comparison of Classification Accuracy
In this subsection, the classification accuracy and the average number of features selected by each algorithm are compared.The accuracy is the percentage of correctly classified test samples, and the average features selected (AFS) is the average number of features to which an algorithm reduces the dimensions of a dataset.AFS is computed by taking the average of the number of features selected by an algorithm in all 20 runs.The best accuracies obtained by each algorithm on all datasets are presented in Table 4.It is evident from the results that EO outperformed all other algorithms on all datasets.However, ASO performed equally well in three cases: DS1, DS5, and DS6.Simulated Annealing, on the other hand, was the worst performer on all datasets.It is important to note that the easiest dataset to classify was DS5, because all algorithms attained 100% on this dataset; however, SA was the only one that had an accuracy of less than 100% on this dataset.In this analysis, DS2 remained the toughest benchmark, on which the best performer could not obtain more than 82% accuracy.In addition to comparing best accuracies, the average accuracy of all runs along with standard deviations are compared in Figure 3.When comparing best accuracies, EO outperformed on all datasets, but due to inconsistent performance in different runs and a slightly higher standard deviation, EO was no longer the best performer on DS1, DS3, and DS6.However, its performance was comparable.On the other hand, ASO managed to outperform EO on DS1 and DS6, and both were equally good at DS2.Moreover, the performance of GSA was noticeably the best on DS3, whereas SA was still the worst performer on all datasets.Finally, the average number of features selected by each algorithm (AFS) along with their standard deviations are presented in Table 5.As shown by the results, HGSO and SCA provided a minimum average number of features on two datasets each, whereas GSA and EO gave minimum AFS on one dataset each.However, if we also relate these results with the best accuracies, then the gain in terms of AFS of a few algorithms does not compensate for their low accuracies.For example, HGSO gave the minimum AFS on DS3, but its accuracy as compared to EO on DS3 was significantly low.Similarly, HGSO also gave the minimum AFS on DS6, but its accuracy was 2.5% lower than SCA, whereas the difference in AFS of both algorithms was just 0.2.Finally, the average number of features selected by each algorithm (AFS) along with their standard deviations are presented in Table 5.As shown by the results, HGSO and SCA provided a minimum average number of features on two datasets each, whereas GSA  Amazingly, ASO, which gave some good results in terms of best accuracy, was unable to minimize dimensions as superbly as the other algorithms did.ASO frequently provided double the AFS as the best AFS provided by any other algorithm, as seen in DS2, DS4, DS5, and DS6.

Convergence Analysis
Convergence is the arrival of a stable point at which the solution stops improving any further.However, if an algorithm converges in very early iterations at a poor suboptimal point, then it is called premature convergence.In this section, we compare these algorithms based on how well they can converge, how well they can avoid premature convergence, and how quickly they can converge.The convergence curves of all these algorithms against each dataset are plotted in Figure 4. First of all, if we discuss the convergence capability, then all algorithms converged in less than the first half of the iterations for all datasets.If we talk about premature convergence, then SA converged prematurely in most cases; however, SCA on DS1, HGSO on DS3, and GSA on DS5 also converged prematurely.Finally, if we analyze their convergence speed, then EO demonstrated very good convergence speed on DS4, DS5, and DS6.In addition to that, ASO also demonstrated good speed on DS1 and DS3, and SCA was better than all the other algorithms on DS2.If we rank these algorithms based on the convergence capabilities, then EO secured the first rank, SCA was the second best, and SCA managed the third-best position.
The convergence speed of all algorithms is also measured in the average number of seconds.The average computational time along with standard deviations are presented in Figure 5.The results showed that SA took the least amount of time to converge on all datasets, which is obviously due to its premature convergence.Another reason may be that it is a single-solution algorithm.Similarly, GSA was the second-fastest algorithm on five out of six datasets.However, if we talk about the computational time taken by the best performers such as EO and ASO, then ASO took much less time than EO on all datasets except DS5.

Overall Performance Analysis
To find the overall best algorithm, we computed the ranks of all algorithms from three perspectives: average fitness, average accuracy, and average features selected (AFS).Once the ranks were determined, the average rank of each algorithm was computed on all datasets.The overall ranks and average rank of all algorithms on each dataset are presented in Table 6.As the results illustrate, the best rank was attained by EO, which was 1.88.However, the second-best position was shared by SCA and HGSO.It is important to mention that SA was the worst performer on the list.tions for all datasets.If we talk about premature convergence, then SA converged pr turely in most cases; however, SCA on DS1, HGSO on DS3, and GSA on DS5 also verged prematurely.Finally, if we analyze their convergence speed, then EO dem strated very good convergence speed on DS4, DS5, and DS6.In addition to that, ASO demonstrated good speed on DS1 and DS3, and SCA was better than all the other rithms on DS2.If we rank these algorithms based on the convergence capabilities, then secured the first rank, SCA was the second best, and SCA managed the third-best positio  The convergence speed of all algorithms is also measured in the average number of seconds.The average computational time along with standard deviations are presented in Figure 5.The results showed that SA took the least amount of time to converge on all datasets, which is obviously due to its premature convergence.Another reason may be that it is a single-solution algorithm.Similarly, GSA was the second-fastest algorithm on five out of six datasets.However, if we talk about the computational time taken by the best performers such as EO and ASO, then ASO took much less time than EO on all datasets except DS5.

Overall Performance Analysis
To find the overall best algorithm, we computed the ranks of all algorithms from three perspectives: average fitness, average accuracy, and average features selected (AFS).In this section, the top three physics-inspired metaheuristic algorithms are compared with the state of the art.For this comparison, results from other KNN-metaheuristic comminations reported by Elminaam et al. [37] were chosen.The metaheuristics chosen for comparison draw their metaphor inspiration from various sources.For example, the Grey Wolf Optimizer (GWO) and Whale Optimization Algorithm (WOA) may be classified as mammal inspired, whereas moth Flame Optimization (MFO) and the Butterfly Optimization Algorithm (BFO) are insect inspired.Similarly, Harris Hawk Optimization (HHO) and the Marine Predator Algorithm (MPA) are inspired by preying behavior seen in nature.Additionally, results based on popular ML algorithms such as Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine (SVM), K-NN, Decision Tree, and Stochastic Gradient Descent (SGD) are also compared, along with their principal component analysis (PCA)-enhanced versions.
In terms of classification accuracy (Table 7), in two out of the three datasets compared, EO outperformed all the other methods.In fact, for the breast cancer dataset and ionosphere dataset, EO was on average 12.75% and 7.12% better, respectively, than the metaheuristics presented in [37].In the sonar dataset, too, EO and SCA were within 2.5% of the best solution reported in [37].Additionally, when compared with the ML algorithms, the EO solution for the breast cancer dataset was on average 17.89% better.An average superiority of 5.68% was seen for EO when compared with the PCA-ML methods reported in [38].
The average features selected for the breast cancer, ionosphere, and sonar datasets by the various metaheuristics are reported in Table 8.It can be observed that the feature reductions by the current physics-inspired metaheuristics are much higher.For the breast cancer, ionosphere, and sonar datasets, the average percent feature reduction achieved by the three physicsinspired algorithms was 85.11%, 88.24%, and 84.89%, respectively, and for the metaheuristic algorithms from [37], it was only 67.62%, 64.71%, and 67.14%, respectively.
Thus, from the comprehensive comparisons shown so far, it is clear that the current KNN hybridized physics-inspired metaheuristic algorithms (especially EO, SCA, and HGSO) are superior to those reported in the literature.Moreover, it is seen that even solutions by hybridized ML algorithms (for example, by dimensionality reduction techniques such as PCA) were inferior to current solutions.This is worth highlighting, since the current wrapper methods are much simpler in terms of computational complexity as compared to the PCA-hybridized ML methods.

Figure 2 .
Figure 2. Average fitness and standard deviation of all algorithms on all selected datasets.

Figure 2 .
Figure 2. Average fitness and standard deviation of all algorithms on all selected datasets.

20 Figure 3 .
Figure 3. Average classification accuracy and standard deviation of all algorithms on all selected datasets.

Figure 3 .
Figure 3. Average classification accuracy and standard deviation of all algorithms on all selected datasets.

Figure 5 .
Figure 5. Average computational time and standard deviation of all algorithms on all selected datasets.

Figure 5 .
Figure 5. Average computational time and standard deviation of all algorithms on all selected datasets.

Table 1 .
Datasets selected for this study.

Table 2 .
Parameter settings of all algorithms.

Table 3 .
Best fitness values of all algorithms.

Table 4 .
Best classification accuracy of all algorithms.

Table 4 .
Best classification accuracy of all algorithms.

Table 5 .
Mean selected feature subsets of all algorithms and their standard deviations.

Table 6 .
Average rank of all averages (Fitness, accuracy, and AFS).
3.4.Comparison with Other Methods from the Literature