Boosting Arithmetic Optimization Algorithm with Genetic Algorithm Operators for Feature Selection: Case Study on Cox Proportional Hazards Model

Feature selection is a well-known prepossessing procedure, and it is considered a challenging problem in many domains, such as data mining, text mining, medicine, biology, public health, image processing, data clustering, and others. This paper proposes a novel feature selection method, called AOAGA, using an improved metaheuristic optimization method that combines the conventional Arithmetic Optimization Algorithm (AOA) with the Genetic Algorithm (GA) operators. The AOA is a recently proposed optimizer; it has been employed to solve several benchmark and engineering problems and has shown a promising performance. The main aim behind the modification of the AOA is to enhance its search strategies. The conventional version suffers from weaknesses, the local search strategy, and the trade-off between the search strategies. Therefore, the operators of the GA can overcome the shortcomings of the conventional AOA. The proposed AOAGA was evaluated with several well-known benchmark datasets, using several standard evaluation criteria, namely accuracy, number of selected features, and fitness function. Finally, the results were compared with the state-of-the-art techniques to prove the performance of the proposed AOAGA method. Moreover, to further assess the performance of the proposed AOAGA method, two real-world problems containing gene datasets were used. The findings of this paper illustrated that the proposed AOAGA method finds new best solutions for several test cases, and it got promising results compared to other comparative methods published in the literature.


Introduction
Datasets of broad sizes exist in several real data applications such as pattern recognition, data mining, signal processing, machine learning, text processing, image processing, and web content classification [1][2][3]. These datasets typically contain a significant range of hard-to-cope features. As a result, the quality of these applications' performance is often decreased by redundant, noisy, and meaningless data [4][5][6].
Researchers used dimensional reduction to eliminate the unimportant and redundant data, which maps the original high-dimensional space data into new lower-dimensional space data [7,8]. Reducing dimensional also allows to imagine and reflect the data and can increase the application's output [9]. Feature selection is one of the most common strategies used in the dimensional reduction domain to solve the dimensional problem. The goal of handling features is to represent, with high precision, the original features in a specific problem domain by an optimal subset of new selected features [10]. It is possible to execute the feature selection process in the backward or forward direction. The backward selection strategy collects all characteristics; it then eliminates one attribute at each step (whose removal reduces the error the most). This method is replicated until some further elimination increases the fault [11,12]. The forward selection strategy starts with a blank set; it adds one element at each stage that reduces the error the most before another addition does not mainly reduce it [13,14].
There are two types of feature selections, filter and wrapper methods in general; the critical distinction between them is in the technique of choosing the subset of new features [15]. To test the feature subset, the wrapper approaches used the learning methodology. However, it is not feasible to use the wrapper to work with a high-dimensional dataset. This is because the related features require significant time to be decided. However, unlike wrapper approaches, the filter strategy does not use learning techniques to select the features. For these purposes, the wrapper is costly for computing, so it can not be extended to large-size files. In comparison, filter algorithms are also less expensive in terms of computing [16,17].
The feature cost, which means the cost of obtaining a feature attribute, is a particular case in machine learning and data mining with different cost types. It can be portrayed in different ways, such as income, time, pain, and measurement cost, to say a few [18][19][20]. In medical diagnosis, it is usually inexpensive and painless to obtain the values of symptom characteristics detected through the eyes or even cost less. However, getting the importance of other diagnostic characteristics often poses varying costs and risks due to the need to perform several clinical examinations. These expenses are either resources or time for test results or the patient's medical and psychological pressures. Improving the diagnostic effect by choosing many critical characteristics is essential for this problem. However, it is also necessary to increase the comfort level by choosing cost-less features or saving money. In this case, before settling on the selected diagnostic features, a doctor needs to estimate the trade-off between the diagnostic impact and the cost. In real-world implementations, there are several related examples. However, most of the conventional feature selection approaches neglect the question of the expense of the function [21,22].
When all possible subsets of the dataset are removed during the generation process, there is very high complexity and a high processing time of x 2 , where x is the number of characteristics in the dataset [23]. Therefore, physicists have sought to formulate approaches to solve the feature selection (FS) problem and provide solutions more efficiently than conventional techniques. This problem is considered one of the most recent problems faced by the new technology due to the size of the available information and data [24][25][26]. The use of metaheuristic algorithms is one such approach. On many topics in artificial intelligence, metaheuristic algorithms have been applied and have led to many solutions [27][28][29]. To solve feature selection problems, metaheuristic algorithms are now widely used [30]. According to parameters governed by the way these algorithms operate, they generate subsets randomly. It has been shown that they can help minimize execution time and produce specific outcomes. Grey Wolf Optimizer (GWO) [31], Whale Optimization Algorithm [32], Monarch Butterfly Algorithm [23] Coyote Optimization Algorithm [33], Genetic Algorithm [34], Krill Herd Algorithm [35] Harmony Search [36], Aquila Optimizer [37], Particle Swarm Algorithm [38], and Parallel Membrane-inspired Framework [39] are examples of the metaheuristics that have been used to address feature selection problems.
In the literature, several techniques have been published [40][41][42][43][44], such as, in this paper, the enhancement is carried out by integrating the opposition-based learning methodology, and differential evolution with the Moth-flame Optimization (MFO) [15]. To maximize the integration of the MFO, opposition-based learning is used to produce an optimum initial population; meanwhile, the differential evolution is applied to boost the MFO's ability to manipulate. Therefore, unlike the conventional MFO algorithm, the suggested approach noted as OMFODE avoided getting trapped in an optimal local value and increase the rapid convergence. This paper proposes a hybrid solution that incorporates two search methods: GWO and Particle Swarm Optimization (PSO) [45]. The GWO is stimulated by the leadership hierarchy and the gray wolves' hunting actions in nature, with gray wolves choosing to live in a pack. The goal of this hybridization is to combine exploitation and exploration in a balanced manner.
In the paper [23], the novel monarch butterfly optimization (MBO) algorithm is implemented with a wrapper feature selection approach that uses the classifier k-nearest neighbor (KNN). On eighteen benchmark datasets, tests are introduced. The results showed that MBO was superior to four optimization algorithms, providing a high accuracy rate in classification. For feature selection challenges of medical diagnosis and other problems in this paper, a hybrid crow search algorithm is developed and integrated with chaos theory and fuzzy c-means technique designated as CFCSA [46]. The crow search algorithm adopts the global optimization methodology in the recommended CFCSA context to prevent local optimization's sensitivity. As a cost attribute for the messy crow search algorithm, the fuzzy c-means (FCM) target function is used. Like other optimization algorithms, the Salp Swarm Algorithm (SSA) suffers from population diversity and crashes into the local optimum. This research provides an improved SSA variant known as the Dynamic Salp swarm algorithm to solve these problems (DSSA) [47]. To fix its challenges, two significant changes were included in the SSA. The first upgrade entails creating a new equation for updating the location of salps. By using Singer's chaotic map, the use of this new equation is regulated. The first change aims to increase the diversity of SSA solutions. The second enhancement entails creating a new local search algorithm (LSA) to increase the exploitation of SSA.
As discussed beforehand, optimization algorithms have shown promising effects when used to address the feature selection problems in recent decades. However, considering increasing research in this direction, whether we need further optimization approaches to find more enhanced outcomes, a fundamental question still emerges. In this regard, these newly introduced metaheuristic algorithms, derived from arithmetic operators, biological evolution, swarm behavior, physical concepts, and mathematical laws, have been increasingly investigated. However, researchers claim that these approaches frequently work ineffectively when there is a substantial increase in complexity and problem dimensionality. This research has two primary motivations: (A) No-Free-Lunch (NFL), which states that there is no optimization technique to solve all optimization problems, so the optimizer's outstanding success on a specific group of problems does not guarantee another group of problems perform equally effectively. This has inspired many scientists in this area to apply the current approaches to new problem groups. The same is the basis and inspiration for this research. We suggest a novel optimization method by integrating the Arithmetic Optimization Algorithm (AOA) and the operators (Crossover and Mutation) of the Genetic Algorithm to solve the feature selection problem with higher dimensionality. This problem can be categorized as hard, and it can not be solved easily by a traditional technique. So, it needs an advanced and improved method to find the optimal solution for the used cases in this paper. (B) To the best of the developers' understanding, the proposed method is used for the first time to solve the feature selection problems. The proposed method tackled the conventional AOA's main weaknesses by avoiding the search strategies' local search problem and search balancing. As the optimization methods are the best choice to deal with such a complicated problem, we use the proposed method according to its previous performance with some improvements to efficiently tackle the feature selection problem and find a new best solution. Twenty feature selection datasets are used to prove the proposed method's performance, and the results are compared with other state-of-the-art methods using the standard evaluation criteria. The results showed that the proposed method's ability is promising in solving the high-dimensional feature selection problems compared to other well-known methods.
The main contributions invented in this paper are given as follows.

1.
A modified approach of the classical AOA and GA is proposed that further enhances the exploration and convergence characteristics of this evolutionary-based wrapper feature selection method through the diverse population design.

2.
Boosted mutation and crossover operators are introduced for search-based exploration and exploitation of the search.

3.
The GA operator's inclusion promotes the convergence rate to balance the exploration and exploitation characteristic of the proposed approach.

4.
Decrease of the feature input set using the proposed search method for high dimensional problems is conducive to develop a high-performing decision method.

5.
Comparing the proposed method with several state-of-the-art methods on twenty datasets is conducted.
The design of the rest of this paper is as follows. Section 2 shows the general methods and the proposed improved algorithm. Then, Section 3 presents the experiments and the discussion of the results. Finally, Section 4 gives a conclusion of this paper and potential future directions works.

Problem Formulation of FS
In this section, the mathematical formulation of FS is introduced. In general, the classification (i.e., supervised learning) of any datasets which has size N S × N F where N S is the number of samples and N F stand for the number of features. The main objective of FS problem is to select a subset of features S from total number of features (N F ) where the size of S is less than N F . This can be achieved by minimizing the following objective function: where γ S refers to the classification error using S and |S| are the number of selected features.
λ is used to balance between ( |S| N F ) and γ S .
Next, the fitness function of each solution is computed to detect the best one X b . Then, depending on the Math Optimizer Accelerated (MOA) value, AOA performs exploration or exploitation processes. Then, MOA is updated as the following equation: in which M t represents the total number of iterations. Min MOA and Max MOA represent the minimum and maximum values of the accelerated function, respectively. More so, the multiplication (M) and division (D) are employed in the exploration phase of the AOA, as presented in the following equation: in which represents a small integer value, UB j and LB j are the lower and upper boundaries of the search domain at jth dimension. µ = 0.5 represents the control function. Moreover, Math Optimizer (M OP ) can be described as: represents the dynamic parameter that determines the precision of the exploitation phase throughout iterations. Furthermore, addition operators (A) and subtracting (D) operators are used to implement the AOA exploitation phase, using the following equation.
In which r 3 represents a random number generated in [0,1]. After that, the agents' updating process is implemented using the AOA operators. To sum up, Algorithm 1 illustrates the main steps of the AOA.

Algorithm 1
Steps of AOA 1: Input: The parameters of AOA such as dynamic exploitation parameter (α), control function (µ), number of agents (N) and total number of iterations M t . 2: Construct the initial value for the agents X i i = 1, ..., N. 3: while (t < M t ) do 4: Compute the fitness function for each agent.

5:
Determine the best agent X b . 6: Update the MOA and M OP using Equation (3) and Equation (5), respectively. 7: for i = 1 to N do 8: for j = 1 to Dim do 9: Update the value of r 1 , r 2 , and r 3 . 10: if r 1 > MOA then 11: Exploration phase 12: Use Equation (4) to update the X i . 13: else 14: Exploitation phase 15: Use Equation (6) to update the X i . 16: end if 17: end for 18: end for 19: t = t + 1 20: end while 21: Output the best agent (feature subset) (X b ).

Genetic Algorithm
In this section, the basic information of Genetic algorithms (GA) is introduced [49]. In general, GA is a population-based meta-heuristic technique, and each individual inside the population represents a feasible solution. There are three stages in GA used to update the individuals, namely Selection, Crossover, and Mutation process. In the Selection process, two individuals are selected randomly, which leads to enhancing the population's diversity. Then the crossover process generates new individuals from the selected individuals (parents) by exchanging their values. After that, the mutation is applied to replace a randomly selected individual with a random value belonging to the search space. Finally, according to the fitness value of newly generated individuals and their parents, the current population is updated by selected the best individuals to form the new population. Then, updating the population using the three processes of GA (i.e., selection, crossover, and mutation) is repeated until reached the stop conditions.

Crossover
One of basic operators in GA is the crossover, in the related literature they are different modification of it. The most simple crossover is the single point method. Here are necessary two parents that are randomly selected from the population. The parents are used to generate an offspring by sing a single point that divides the information contained in them. By using a single points the values after it are interchanged between the two parents and new solutions are created. The Figure 1 graphically shows how the single point crossover works. The single point crossover is a good alternative, but for real code purposes it is better to employ another version. The blend crossover also known as BLX-α is a real coded operator. Similar to the single point, it is necessary to take two parents x 1 and x 2 from the population. By using the parents it is extracted a portion x c i from bot of them. Equation (7) provides a better explanation of BLX-α.
where, x 1 i and x 2 i are elements taken from x 1 and x 2 and α is positive value setting to 0.5 according to [50].

Mutation
The mutation is an operator that helps to explore around of an specific solution. Similar to the crossover they are several ways to perform the mutation. However, in this article it is considered the Gaussian mutation that was introduced by Higashi and Iba [51]. In this kind of mutation it is necessary to take an element form the population and it is modified by using a random number created by a Gaussian distribution. The modified solution is a mutated individual and it is computed as follows: From Equation (8) x id is the selected individual from the population, Gaussian (σ) is a random number generator that uses a Gaussian distribution with a standard deviation of σ = 0.1.

Selection
The selection operator is also important because it helps to extract the elements of the population that will be manipulated by the crossover and mutation. Here is also possible to find different mechanisms but the most common is the roulette wheel [52]. This method is based on the fitness and it works by assigning a probability ps to a each member of the population. The population then is segmented into different regions represented by the individuals. In a population of n candidate solutions defined as P = {a 1 , a 2 , . . . , a n } the element a i possesses a fitness value f (a i ), then the probability of a i to be selected is computed as:

Proposed AOAGA Feature Selection
Optimization techniques, as mentioned above, have been successfully used in many research fields to solve various complicated problems. In this section, the proposed optimization method is presented to solve the feature selection problems. This problem is a widespread complex issue that has appeared in many knowledge-based approaches, and it needs an efficient method to solve. It is typically based on selecting the optimized features from a massive amount of features to reduce the computational time and increase the performance of the underlying system analysis. Figure 2 depicts the structure of the developed feature selection method. This method depends on enhancing the performance of AOA to find the optimal subset of relevant features using the operators of the genetic algorithm (GA). The developed FS method is called AOAGA. The main difference between AOAGA and the original AOA is that the exploration phase of the proposed AOAGA is improved and can explore more regions in the search domain than the original version of the AOA, and it also can escape from getting stuck in local optima due to operators of the GA.
The AOAGA starts by setting the initial population U, which has N number of agents; this formulated using the following: In Equation (10), α i ∈ [0, 1] is a random value. The UB i = 1 and LB i = 0 are limits the search domain. The next step in the developed AOAGA is to assess the quality of the selected feature. This is achieved by converting each agent into the binary form using the following equation.
Thereafter, computing the classification error after removing the irrelevant features that corresponding to zeros in BU. This is performed by using Equation (12).
In Equation (12), λ ∈ [0, 1] refers to the weight applied to balance between the two sides of Equation (12). N F refers to the number of features, and |BU i | is the number of selected features corresponding to ones inside U i . γ i is the classification error using the features in U i and is computed based on the KNN classifier. In this study, KNN is learned using a training set representing 80% while the rest dataset is used as a testing set (20%) to evaluate the learned KNN.
Thereafter, the best agent U b is determined and used to update the other agents with the operators of GA and AOA. This updating process is performed using Equation (13).
The next step is to check the stop conditions if they are not met and then repeat the updating process. Otherwise, the best agent is returned and updated the testing set according to it and evaluates the classification's quality using the updated testing set. The flowchart of the developed AOAGA is given in Figure 2.
The selected primary references in this paper are chosen according to their importance and results in this field. We focused on the most related research in this field to support our research and get significant results and descriptions. However, the main limitations of the proposed method in this paper are selecting real-word feature selection problems for other medical purposes and compared with other advanced methods published in the literature in this domain. This process can further prove the ability of the proposed method to solve various feature selection problems.
The complexity of the developed AOAGA depends on some parameters such as number of agents N, total number of iterations M t , and the dimension of the tested problem n. So, the complexity of AOAGA in terms of Big O can be formulated as: Since M t2 = M − M t1 , so we can rewrite Equation (14) as: where M t1 stand for the number of iterations used to update solutions using operators of GA.

Experimental Results and Discussion
In this study, the developed AOAGA to improve the performance of classification data is evaluated by removing the irrelevant features. This was achieved using twenty UCI machine learning repository datasets [53] and real-world datasets from [54,55].
These algorithms are run on an 8 GB RAM Intel Core i5 processor using Matlab 2014b. The population is set to 25 whereas, the max number of iterations is 100. Thirteen independent runs are produced for each algorithm.

Performance Measures
To validate the performance of developed AOAGA, a set of evaluation metrics is used. For example, accuracy, number of selected features, the average and standard deviation of fitness value [63][64][65]. The definition of each measure is given as: • Average of accuracy (Avg acc ) is used to compute the ability of an algorithm to predict the correct label of each class over the runs. Higher value is better [65]. It is defined as: • Standard deviation (STD) is used to check to what extent an algorithm can obtain the same results over different runs. Smaller value is better [65]. It is formulated as: • Average of selected features (AVG |BX Best | ) is applied to test an algorithm's ability to choose the smallest subset of relevant features overall runs. Smaller value is better [65]. It is given as: where |.| denotes the cardinality of BX k Best at k-th run. • Average of fitness value (AVG Fit ) evaluates the algorithm ability to balance the lower error and ratio of selected features. Smaller value is better [65]. It is formulated as: where TP and TN refer to the true positive and true negative. Whereas FN and FP are the false positive and false negative, respectively [66].

Experimental Series 1: UCI Datasets Results and Discussion
Within this experiment, a set of UCI datasets are used. These datasets are collected from different fields such as Biology, Game, Electromagnetic, Politics, Physics, and Chemistry. In addition, each of them has a different number of samples, features, and classes. The description of each dataset is given in Table 1. This subsection presents and discusses the experimental results and comparisons obtained in solving the problem of feature selection. The comparisons used the standard AOA, SMA, HHO, GA, MVO, SSA, MFO, GOA, PSO, and GWO with the metrics described in the previous sections. In Table 2 they have presented the experimental results by using the mean of fitness function for all the compared methods over the 20 data sets. From this table, it is possible to see that the proposed AOAGA is superior in 16 of the 20 experiments; meanwhile, the PSO obtains the best results in 4 cases and the GWO only in 1. These results represent that the AOAGA is superior and accurate regarding the fitness value for feature selection. In the tables, the boldface indicates the best value. Continuing with the fitness value, it is possible to analyze the minimum (MIN) and the maximum (MAX) value of the fitness functions. This study then permits us to know when the algorithms get the best and worst value. Table 3 shows the min fitness values obtained for the selected algorithms in all the datasets. The AOAGA, PSO, and GWO have lower values in most cases (13 of the 20 datasets), which occurs because these algorithms can also produce the optimal values. On the other hand, Table 4 shows the MAX values of the fitness value obtained after the experiments of selected algorithms over the 20 datasets from the UCI. The AOAGA is the method that provides the MAX value for all the cases and the PSO only for two cases. The rest of the algorithms cannot get any maximal from the experiments. The stability of the results computed by the algorithms is analyzed by using the standard deviation (STD). The STD is calculated after all the independent experiments were performed for each dataset using the fitness value as input. In this case, the experimental runs are set to 30. A lower STD represents better stability of the results. In other words, no substantial changes in the experiments. Table 5 presents the values of the STD, where the AOAGA has the lower value in 13 out of 20 datasets, the HHO and the PSO only in 3, the AOA in 2, the MVO in 1, and the rest methods did not achieve the best results in any datasets. On the other hand, as was previously explained, the Accuracy evaluates the quality of the classification based on the true positive, true negative, false positive, and false negative values. Here it is expected that the values obtained are close to 1, which represents a higher Accuracy. In Table 6 are presented the values for the selected algorithms; the proposed AOAGA is the method that has better accuracy for the classification of the feature. The AOAGA gets the value closer to 1 in 17 of the 20 datasets, the PSO in 3, the MVO and the GWO in 1, and the rest is zero. Moreover, Figures 3 and 4 illustrates the good performance of the proposed method over the average of the fitness values and accuracy measure.   In the feature selection problem, it is necessary to identify a set that contains a reduced number of the most representative features. Then the number of selected features could also be considered a metric to verify the performance of the experiments over the selected datasets. The number of selected features for each algorithm is presented in Table 7. From this table, the AOAGA is the algorithm that has the lower valuer 16 times, the SMA and the GOA in 2 times, the AOA and HHO only 1 time. The values of the rest of the algorithms are higher. The computational time for each algorithm is analyzed in Table 8. In this case, it is expected to have a reduced value. However, this does not represent a good performance because a fast algorithm is not necessarily accurate. This could be seen in Table 8, where the SMA is the algorithm that has the lower time 13 times, followed by the MFO with 3 times and MVO, SSA, PSO, and GWO with only 1 time. In this case, the proposed AOAGA is the one with higher computational time due to the hybridization of the operator; however, its performance is better than the rest of the algorithms in the comparisons. Figures 5 and 6 depict the average of fitness value and their boxplots, respectively. From these curves it can be seen that the AOAGA has convergence rate better than other methods. This can be observed from the second half of iterations in most datasets. In addition, it can be noticed from boxplot that the developed AOAGA has the lower box in addition to SMA is the worst MH technique according to the obtained results in this study.   Table 9 shows the mean rank obtained by using non-parametric Friedman test. The main objective of this test is to determine if there is significant difference between the developed AOAGA and other methods. From these results it can be seen that the developed AOAGA has smallest mean rank at thirteen datasets which represent nearly 65% from all datasets. Followed by PSO which has the best mean rank at six datasets nearly 30% from all datasets. This results indicates the ability of developed method to convergence faster than other methods. In this section, the developed AOAGA is compared with other state-of-the-art FS methods namely SMAFA [9], BSSAS3 [67], bGWO2 [68], SbBOA [69], BGOAM [70], Das [71], and S-bBOA [69].
The comparison results between the developed method and other FS methods are given in Table 10. From this table it can be noticed that the developed AOAGA shows good performance and it obtains the higher accuracy at sixteen datasets. Whereas, the SMAFA has the best accuracy at seven dataset, as well as, BGOAM is the best at two datasets.

Experimental Series 2: Real Application of AOAGA
Survival data with censoring have appeared frequently in real application, such as biology and epidemiology [72,73]. Nowadays, the data of gene expression are increasingly applied to different clinical outcomes in order to facilitate disease diagnosis. Such these data are high dimensional where the number of genes exceeds the number of observations [74]. Regression technique is a standard practice to study jointly the influence of multiple predictors on a response. The Cox regression technique is one of the most standard regression techniques of survival data with censoring. When the the dimensionality of the predictors being large, the traditional method of estimating Cox regression technique is undesirable since its prediction accuracy is low and is hard to interpret [75]. To tackle this issue, feature selection has become an important focus in Cox regression technique. To examine the performance of the proposed hybrid algorithm, AOAGA, two real gene datasets were used. The first dataset is the Diffuse large B-cell lymphoma (DLBC2002) [54]. DLBC2002 contains 240 lymphoma patients' samples in which Each patient has 7399 gene expression measurements. The second dataset is the Lung cancer dataset (Lung-cancer) [55]. This dataset contains 86 lung cancer patients' information, for each of whom 7129 gene expression were measured. For both datasets, the response variable is the survival time including censored or not.

Results and Discussion of Real Gene Datasets
In order to show the performance that can be achieved by our AOAGA and other used algorithms on the two used datasets, the average values, MIN and MAX values of the log-likelihood as a fitness function were given in Tables 11-13, respectively. From Table 11, the proposed algorithm, AOAGA, achieved a better performance for the two datasets, respectively. Moreover, it is clear from the results that the AOAGA more successful than the AOA for all datasets. This enhancement is mainly becuase of the developed algorithm ability in taking into account the limitation of the standard AOA algorithm. In terms of standard deviation criterion in Table 14, the AOAGA attained the lowest standard deviation value in both DLBC2002 and Lung-cancer dataset and considered the most stable one than the compared algorithms. Table 15 summarize the results in average of different used algorithms applied. According to Table 15, the numbers of genes selected by MFO are larger than those of the all used algorithms. Among the other used algorithms, the proposed algorithm, AOAGA, selected less genes. In Lung-cancer dataset, for example, the AOAGA selected least ratio of genes. In DLBC2002 dataset, the AOAGA showed comparable results to SMA algorithm.

Conclusions
This paper proposed a novel feature selection method based on an improved Arithmetic Optimization Algorithm (AOA) to generate a new subset of best features. The main idea of the proposed method, called AOAGA is to apply the operators of the genetic algorithm (GA) to boost the performance of the traditional AOA. The traditional AOA suffers from weaknesses, the local search strategy, and the trade-off between the search strategies. Thus, the proposed method works using a new transition mechanism to transfer between the AOA with the GA operators are used to guarantee the solutions' diversity is kept. We evaluated the AOAGA with twenty well-known benchmark datasets to verify its effectiveness in solving different feature selection problems. More so, several standard evaluation criteria were used to evaluate the results of the AOAGA, including accuracy, the number of selected features, and fitness function. Moreover, to further assess the performance of the proposed AOAGA method, two real-world problems containing gene datasets are used. Finally, the results were compared with several well-known state-of-the-art techniques to prove the performance of the proposed AOAGA method. The results illustrated that the proposed AOAGA method finds new best solutions for different test cases, and it made promising results compared to other comparative methods published in the literature. However, there is certain limitation that must be addressed, such as the computational time of the proposed method in case of high dimensional datasets.
For future work, the proposed AOAGA can be investigated further in order to adapt its operator accurately and get further improvement. It can also be modified differently to adjust its search operators. Moreover, the proposed AOAGA can be tested to solve other benchmark optimization and real-world problems such as clustering, image segmentation, task scheduling in fog computing, medical data classification, sentiment analysis, parameter estimation, and others.