Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data

: The advancements in intelligent systems have contributed tremendously to the ﬁelds of bioinformatics, health, and medicine. Intelligent classiﬁcation and prediction techniques have been used in studying microarray datasets, which store information about the ways used to express the genes, to assist greatly in diagnosing chronic diseases, such as cancer in its earlier stage, which is important and challenging. However, the high-dimensionality and noisy nature of the microarray data lead to slow performance and low cancer classiﬁcation accuracy while using machine learning techniques. In this paper, a hybrid ﬁlter-genetic feature selection approach has been proposed to solve the high-dimensional microarray datasets problem which ultimately enhances the performance of cancer classiﬁcation precision. First, the ﬁlter feature selection methods including information gain, information gain ratio, and Chi-squared are applied in this study to select the most signiﬁcant features of cancerous microarray datasets. Then, a genetic algorithm has been employed to further optimize and enhance the selected features in order to improve the proposed method’s capability for cancer classiﬁcation. To test the proﬁciency of the proposed scheme, four cancerous microarray datasets were used in the study—this primarily included breast, lung, central nervous system, and brain cancer datasets. The experimental results show that the proposed hybrid ﬁlter-genetic feature selection approach achieved better performance of several common machine learning methods in terms of Accuracy, Recall, Precision, and F-measure.


Introduction
In the last two decades, research studies in health informatics have investigated several issues related to bioinformatics, cheminformatics, cancer prediction, and others.For instance, the estimation of the number of deaths caused by heart disease was about 12 million deaths yearly worldwide according to World Health Organization (WHO).
Several methods have utilized some common machine learning for the prediction of gene selection and cancer informatics such as [1,2], prediction of new bioactive molecules [3], and heart disease prediction [4][5][6][7].Although there are many research studies conducted on cancer informatics, the cancer disease still threatens human lives and its rare increases over time since the prediction of this dangerous disease in its earlier stage is a big issue in health informatics.
In the past few years, developing many methods based on microarray datasets for analyzing gene expression provided new ways to conduct hot research in bioinformatics, cancer prediction, and similar fields [8].These datasets contain information about human genes and methods of their expressions.Based on the analysis of this information, several studies could be conducted by biologists efficiently, which means they will consume less time and low cost to run their experiments [9].
Recently, many machine learning methods have been applied in the analysis of microarray datasets used for cancer classification [10].Using the expressions of the genes in microarray datasets can be utilized as a good technique for cancer diagnoses.However, the number of existing genes is growing, about more than hundreds of thousand, while the sizes of the available datasets are still small, which contain fewer subsets of samples.This leads to the curse of dimensionality, which is one of the issues in the analysis of microarray datasets used for cancer classification [10].In addition, there is another issue related to the nature of the existing datasets which include many redundant and irrelevant features that negatively affect the computational cost [11].The duplicated and irrelative features do not help to provide a good classification and perdition in high-dimensional data [12].These features reduce the performance of the prediction model and make the search for valuable knowledge more difficult.Therefore, feature selection methods are needed to be applied to improve the classifier's accuracy [13].
In order to improve the performance of these popular machine learning techniques, several feature selection methods have been utilized to select the most significant features of cancerous microarray datasets [14][15][16][17][18][19][20].Although the filter feature selection methods are computationally faster and can be used to reduce the high dimension of microarray datasets, their performances are not sufficiently accurate and different since the features are evaluated independently of classifiers.In contrast, the wrapper feature selection methods interact with the classifier during the features evaluation, so they achieve better results compared to the filter method.However, the wrapper methods are time-consuming when they are applied on high-dimensional microarray datasets.
In the last few years, evolutionary algorithms are successfully employed in feature selection in many fields [21][22][23][24].Although evolutionary algorithms-based feature selection methods overcome the filter and wrapper method, they may require a longer time for some machine learning algorithms.
Since the cancerous microarray datasets are high dimensional datasets including a vast number of features, it is impractical to use evolutionary algorithms at the beginning as feature selection methods.This encourages us to propose a hybrid filter-genetic feature selection approach that inherits the advantages of both methods and can produce promising solutions with higher performance of cancer classification in high-dimensional microarray datasets.
In this paper, combinations of filter methods and genetic algorithm-based feature selection methods are applied to identify an optimal subset of features for enhancing the cancer classification performance of machine learning methods on high-dimensional microarray datasets.In this study, information gain (IG), gain ratio (IGR), and Chi-squared (CS) are applied as three common filter methods to compute a score of each feature of microarray cancer datasets.Accordingly, only the top-ranked features are selected while the other redundant and irrelevant features are eliminated to reduce the high-dimensional microarray datasets.Then, the reduced cancer datasets with only the top-ranked features selected by the filter methods are further optimized by the genetic algorithm (GA) to achieve better cancer classification results.We can summarize the main contributions of the paper as follows:

•
Compared to previous works, we used IG, IGR and CS as three popular, simple and fast filter techniques to choose highly relevant features in order to reduce high-dimensional datasets: Brain, Breast, Lung, and CNS datasets.Although many microarray datasets are used in the literature, recent work [25] reported that the popular machine learning techniques achieved the lowest classification accuracy on these specific four microarray datasets: Brain, Breast, Lung, and CNS datasets.Furthermore, the performance improvements produced by several existing works on these specific four cancer datasets were limited.

•
Since IG, IGR and CS evaluate features individually by finding the relationship between each feature individually with the class label, GA is then utilized to find the relationship between a set of features together with the class label to further optimize the selected features obtained from the filter methods to enhance the cancer classification performance.

•
The experimental results showed outstanding enhancements accomplished using the proposed hybrid filter-genetic feature selection approach.
The remainder of the article is structured as follows.Section 2 presents the related studies on feature selection for gene selection and machine learning methods used for cancer prediction.Filter feature selection and genetic algorithm are explained in Sections 3 and 4, respectively.Section 5 presents the research methodology of the proposed hybrid filter-GA feature selection method.Section 6 presents the experiments and evaluation, and then discusses the performance results of the proposed method.Section 7 concludes the main findings of this paper.

Related Work
Generally, cancer disease is considered one of the main leading reasons of death.For saving patients' lives, it is important to early identify and predict the cancer type using advanced technological solutions, such as artificial intelligence and machine learning.Several medical datasets were used in these diagnoses, including the microarray gene expression data.According to work in [26], the microarray datasets suffer from two issues, the high dimensionality, and the small sample size, which make cancer classification a nontrivial task.The authors in [27] discussed the issue of high dimensionality for the gene expression dataset, which is known as the microarray dataset, and reported that selecting the most important genes is still a challenging task in this research field.
Several feature selection and machine learning methods were used on genetic datasets.For instance, the work in [28] selected the genes that act as regulators and mediate the activity of transcription factors that have been found in all promoters of the expressed gene sets.The selected gene set was fed to Dynamic Bayesian Networks (DBNs) to classify the tumor from normal samples.The authors in [29] proposed a feature selection method using the discrete wavelet transform (DWT) and a modified genetic algorithm to identify the most important and relevant features for microarray cancer classification.The findings of this study showed, in most cases, superior results compared to the existing classification techniques.Similarly, the work in [30] used five microarray cancer datasets for cancer classification and proposed feature selection methods based on wrapper and Markov blanket models.The experimental results offered high accuracy rates compared to the traditional classification methods applied on cancer microarray datasets.
Genetic algorithm (GA) is actively used as a feature selection method in different applications.For gene selection, GA was used with a t-test in [31] as an ensemble feature selection method.In this study, the t-test was used to pre-process the data, and then Nested-GA was applied to get the optimal set of genes on colon cancer and lung datasets.Ghosh et al. [32] introduced a feature selection method with two stages on microarray datasets.In the first stage, the union and intersection of the top-n features of symmetrical uncertainty, chi-square, and ReliefF were used as ensemble filter methods.The results of this stage were fed to the GA to get the optimum set of features.The proposed method was applied on five cancer datasets and the findings showed super performance compared to the existing methods.Recently, Abasabadi et al. [33] introduced a hybrid feature selection method by combining SLI-γ filter feature selection method and genetic algorithm (GA).The proposed model showed robust prediction and less execution time, especially when 1% of the best-ranked features were used for generating the GA population.Similarly, the authors in [34] highlighted the importance of proposing feature selection for microarray datasets because of the risk of over-fitting due to the small size of the data samples.Therefore, they introduced Multi-Fitness RankAggreg Genetic Algorithm (MFRAG) that combines nine feature selection methods for evaluating the feature weights and individuals and using ensemble models to compute the individual fitness.The experiments were conducted on several microarray datasets and the findings showed that the proposed method obtained superior accuracy comparing to the existing methods.In addition, the authors in [27] developed a feature selection method on several cancerous microarray datasets based on monarch butterfly optimization that is wrapped with the Broad Learning System (BLS).
Other previous studies applied features selection and machine learning methods on the same datasets used in this work such as Brain, Breast, Lung, and CNS datasets.For instance, Hameed et al. [35] applied the combination of Pearson's Correlation Co-efficient (PCC) with Genetic Algorithm (GA) or Binary Particle Swarm Optimization (BPSO) for on these microarray datasets.They obtained a good performance when SVM was applied with PCC and GA feature selection combination (up to 98.33% of accuracy for CNS datasets).However, these methods obtained lower performance for the other datasets (up to 88.66% of accuracy for the same model on Breast dataset).In addition, the authors in [36] applied fusion-based feature selection method on Brain, Breast and CNS microarray datasets.The highest accuracy was achieved (95%) when SVM was applied on Brain dataset.However, the model achieved lower performance with the other datasets.Similarly, a hybrid feature selection method on these four microarray datasets was applied in [37].The method combined the Gini index and support vector machine with Recursive Feature Elimination (GI-SVM-RFE).However, the highest achieved accuracy by this model was 90.67% for Breast dataset.In addition, Almugren and Alshamlan [38] conducted a survey on the existing hybrid filter feature selection and wrapper feature selection with machine learning methods that were applied on microarray datasets.It can be observed that most of the conducted studies [39][40][41][42][43][44][45] worked on the datasets with lower dimensionality (comparing to the Brain and Breast datasets applied in this study) such as Colon, leukemia 1, leukemia 2, Prostate and SRBCT (Small round blue cell tumors) datasets.
Although evolutionary algorithms have been utilized in the feature selection process on microarray datasets for cancer classification, using evolutionary algorithms as filter or wrapper feature selection methods in microarray datasets is still being investigated in recent studies.Furthermore, there is still a need to conduct more research works to investigate different hybridizations and combinations of filter methods with evolutionary algorithms on different microarray datasets.

Filter Feature Selection
Many microarray datasets suffer from the problem of high-dimensional data with noisy data, which can cause inaccurate prediction and low classification accuracy, and slow performance of machine learning techniques [26].Feature selection is one of the most crucial pre-processing steps used to identify the most influential features in order to increase the performance of machine learning.Due to limited resources, it is impracticable or complicated to use all features of high-dimensional microarray datasets with machine learning algorithms.Thus, it is crucial to utilize a feature selection method in cancer classification problems of high-dimensional microarray datasets to remove noisy data and eliminate redundant and irrelevant features [27].
The feature selection methods are broadly classified into filter and wrapper approaches based on the process of feature evaluation.In the filter approaches, the features are evaluated based on certain criteria independently of a classifier.The wrapper approaches, by contrast, employ a classifier to evaluate the features and then select the best features.The wrapper methods are computationally intensive since they train a machine learning algorithm several times with many potential subsets of features.In contrast, the filter approaches are easier and faster compared to the wrapper approaches as they are accomplished before the training of a machine learning algorithm [22,46].

Genetic Algorithm
The genetic algorithm [47] is one of the most effective evolutionary algorithms inspired by the biological evolution of chromosomes.The genetic algorithm (GA) is successfully utilized for solving several searching and optimization problems in many real-world applications.In recent years, GA has been used effectively to identify the optimal features set in many different fields [21][22][23][24]48].
GA starts by initializing a population consisting of a set of chromosomes created arbitrarily.Each chromosome in the population represents a potential solution and includes several genes.Then, GA reproduces new better chromosomes (solutions) by evaluating the current chromosomes and then recombining the fittest chromosomes.
At each GA generation, a pair of fit chromosomes are chosen depending on fitness function to be parents for mating.In GA, the tournament and roulette wheel methods are the two most popular selection methods used in the literature.The genetic crossover and mutation operators are then applied to create new offspring chromosomes used for the next generation.In the GA crossover, a crossover point in the parent chromosomes is arbitrarily chosen and then genes after that point are exchanged to produce new children.In the GA mutation, GA alters randomly the gene values in the offspring chromosome.
Over consecutive generations, the population iteratively evolves toward an optimal solution using selection, crossover, and mutation until the termination criterion is satisfied.

Proposed Methodology
This section provides a detailed description of the proposed hybrid filter genetic algorithmbased feature selection approach used for cancer classification in high-dimensional microarray datasets.As shown in Figure 1, the methodology includes three phases: collection of highdimensional microarray data, training phase and classification phase.

Genetic Algorithm
The genetic algorithm [47] is one of the most effective evolutionary algorithms inspired by the biological evolution of chromosomes.The genetic algorithm (GA) is successfully utilized for solving several searching and optimization problems in many real-world applications.In recent years, GA has been used effectively to identify the optimal features set in many different fields [21][22][23][24]48].
GA starts by initializing a population consisting of a set of chromosomes created arbitrarily.Each chromosome in the population represents a potential solution and includes several genes.Then, GA reproduces new better chromosomes (solutions) by evaluating the current chromosomes and then recombining the fittest chromosomes.
At each GA generation, a pair of fit chromosomes are chosen depending on fitness function to be parents for mating.In GA, the tournament and roulette wheel methods are the two most popular selection methods used in the literature.The genetic crossover and mutation operators are then applied to create new offspring chromosomes used for the next generation.In the GA crossover, a crossover point in the parent chromosomes is arbitrarily chosen and then genes after that point are exchanged to produce new children.In the GA mutation, GA alters randomly the gene values in the offspring chromosome.
Over consecutive generations, the population iteratively evolves toward an optimal solution using selection, crossover, and mutation until the termination criterion is satisfied.

Proposed Methodology
This section provides a detailed description of the proposed hybrid filter genetic algorithm-based feature selection approach used for cancer classification in high-dimensional microarray datasets.As shown in Figure 1, the methodology includes three phases: collection of high-dimensional microarray data, training phase and classification phase.

Collection of High-Dimensional Microarray Data
In this paper, we used four high-dimensional cancerous microarray datasets to assess the performance of the proposed hybrid filter-GA feature selection method.These four datasets are Lung cancer [49], Central Nervous System (CNS) [49], Breast cancer [50], and Brain cancer [51,52].The description of the high-dimensional datasets used in this study is displayed in Table 1.The Breast cancer dataset has 97 samples or instances including 24,481 features or genes.The Breast cancer dataset used in this study consists of 46 cancer samples that had cancer that spread in a different part or created distant metastases within 5 years, and 51 stayed free of distant metastasize for at least 5 years.The Lung cancer dataset has 203 samples with five classes and 12,600 features or genes.The samples in the Lung cancer dataset are labeled with normal lung class (17 samples) and four lung tumors classes: adenocarcinoma (139 samples), small cell lung cancer (6 samples), squamous cell carcinoma (21 samples), and pulmonary carcinoid (20 samples).The Brain tumor dataset has 42 microarray samples with 5597 features or genes and five classes.The five classes of the Brain tumor dataset are medulloblastomas, malignant gliomas, atypical teratoid/rhabdoid tumors, primitive neuroectodermal tumors, and human cerebella.The CNS cancer dataset has 60 with 7129 genes and two classes: 21 samples are survivors of cancer and 39 are failures.The datasets used in this study are then divided into parts: training dataset is used in the training phase while the testing dataset is used in classification (testing) phase.

Training Phase
The training phase consists of three main stages: feature ranking using filter algorithms, GA-based feature selection, and training of machine learning.

Feature Ranking Using Filter Algorithms
Since the microarray datasets used in this paper are high dimensional datasets with too many features, it is not applicable or time-consuming to use wrapper or evolutionary algorithms at the beginning as feature selection methods.So, it is an essential stage to decrease the high dimensional datasets using filter feature selection algorithms before applying evolutionary feature selection algorithms.
In this paper, information gain, gain ratio, and Chi-squared were applied as three common filter methods to compute a score of each feature.Then, only the top 5% of ranked features were chosen while the other irrelevant and redundant were eliminated to lower the high dimensional datasets.

•
Information gain The information gain (IG) is one of the popular filter techniques that was successfully applied to choose highly relevant features in order to reduce high-dimensional datasets in many applications.The IG uses the entropy measure to determine the relevance of features by calculating the information gain of features with respect to class labels.In the IG, Equation ( 1) is used to evaluate the features: where Value(A) represents the values set of a feature A, while S v denotes the subset of S for which feature A has value v.
We needed to calculate Pr(c j ), which represents the probability of class c j in S, to compute Entropy(S) as shown in Equation (2).
The IG is commonly used to identify the significance degree of a feature.However, IG may suffer from an overfitting problem due to it being biased towards features with many different values.

• Information gain ratio
The information gain ratio (IGR) was introduced to improve the performance of information gain by taking the number and size of branches into account when choosing an attribute to reduce its bias toward high-branch attributes.
IGR uses Equation (3) to evaluate the features: Split information (S, A) is computed using Equation ( 4): where S and S i represent the original dataset and the ith sub-dataset after being split while |S| and |S i | are the numbers of samples belonging to S and S i , respectively.
• Chi-squared Chi-squared [53] is one of the simple and fast filter techniques which is used to determine the significant difference between features by examining the independence of data between two features.In general, Chi-squared (CS) computes the dependence between features and class.The null hypothesis for Chi-squared is tested by the χ2 as shown in Equation ( 5) with the assumption that the feature and class label are independent.In the Chi-squared test, the summation of squared differences between observed and expected values is computed as shown in Equation (5).The importance of each feature was evaluated by calculating χ2 with respect to the class.The feature with higher χ2 was a more important feature for the classification decision.
where O ij and E ij represent the observed frequency and expected frequency, respectively, while c is the class number and r is the number of bins used for the discretization of numerical features.

GA-Based Feature Selection
Although the filter methods can reduce the high dimensional training datasets, the performance of such methods was not sufficiently accurate since the features were evaluated based on certain criteria independently of a machine learning algorithm.Furthermore, most of the filter approaches evaluate features individually by finding the relationship between features and the class labels while they assume all features are independent.Therefore, GA was utilized to further optimize the selected features obtained from the filter methods to enhance the cancer classification performance.
GA is a global optimization searching algorithm that is effectively applied as a feature selection technique to identify the most significant features in many applications.
The feature selection based on GA is generally conducted by the following four key stages: 1.
Chromosome encoding: GA population includes a set of chromosomes and denotes search space which represents all possible feature subsets.Each chromosome in the population represents a feature subset and it is encoded with a binary string containing m genes, where m is the number of available features.If the feature is selected, the gene will be encoded by one, otherwise, it will be represented by zero.

2.
Population initialization: initially, GA generates arbitrarily an initial population of chromosomes that correspond to subsets of the potential attributes.

3.
Fitness evaluation: GA evaluates the fitness of the individual chromosome by computing the fitness function of each individual chromosome.In the GA-based feature selection, the training dataset containing the features selected for a chromosome is utilized to train the machine learning technique and then GA calculates the classification accuracy, which is used as the fitness of that chromosome.In this step, GA tries to find the ideal subset of features that maximizes the machine learning performance.4.
Reproduction: like biological evolution, the fittest chromosomes are selected and recombined to reproduce and evolve better new chromosomes or solutions.In GA reproduction, three genetic operators are used in GA to perform the reproduction procedure: • Selection: the chromosomes that have better fitness values are chosen as parents to generate new children.

•
Crossover: in this process, GA exchanges the genes of two parent chromosomes after a crossover point chosen randomly in order to produce a new child chromosome.

•
Mutation: the GA mutation is performed by changing occasionally value of a gene for the child chromosome from 1 to 0 or from 0 to 1.
GA iteratively evolves the chromosomes to generate a different generation of better new solutions by repeating the fitness evaluation and reproduction process until GA meets one of the termination criteria such as obtaining satisfactorily optimal fitness or reaching maximum generations.Algorithm 1 shows the pseudocode of the hybrid filter genetic algorithm-based feature selection approach.Obtain the best chromosome C best 14 Extract the optimal selected features SF from C best (the genes with 1) 15 Return SF 16 End Algorithm

Training of Machine Learning Techniques
In this step, the reduced training dataset with the optimal features selected by GA was utilized to train some common machine learning algorithms for classifying cancer in high-dimensional microarray datasets.In this study, the support vector machine (SVM), naïve Bayes classifier (NB), k-Nearest neighbor (kNN), decision tree (DT), and random forest (RF) were chosen since they are commonly used in the literature to classify cancer in the high-dimensional microarray datasets.Then, we kept the trained classification models to be employed in the classification phase with the new testing datasets.

Classification Phase
In this phase, the classification models trained in the training phase were evaluated with a new dataset called the testing dataset.The initial testing dataset was reduced by selecting only the same top features ranked by filter algorithms in the training phase.Furthermore, the optimal features selected by GA in the training phase were then employed to select the substantial features of the testing dataset.Accordingly, the trained classification models were employed to classify cancer in the final testing dataset with the optimal feature subset and then their performances were evaluated using popular classification measures such as Classification Accuracy, Recall, Precision, and F-measure.

Experimental Settings
We conducted many experiments to identify the best GA parameters.In this study, the best parameters used in the proposed hybrid filter-GA feature selection method were selected by a trial-and-error basis in order to produce the best results.Table 2 shows the settings of GA parameters used with the proposed hybrid filter-GA feature selection method on all experimental datasets.

Performance Metrics
In this study, 10-fold cross-validation was used to assess the hybrid filter-GA feature selection method proposed to enhance the performance of popular machine learning.The selected feature number and performance measures of the testing dataset were computed for each run in 10-fold cross-validation.Then, the overall selected attributes number and performance measures were the average for all runs.
In addition to the number of selected features, the Classification Accuracy, Recall, Precision, and F-Measure were used to measure the performance of the proposed hybrid filter-GA feature selection method.The Classification Accuracy, Recall, Precision, and F-Measure are briefly explained as follows: Classification Accuracy is the percentage of instances correctly classified as shown in Equation (6).
where TP indicates the number of positive instances correctly classified as positive instances, TN represents the number of negative instances correctly classified as negative instances, FP represents the number of the negative instances incorrectly classified as posi-tive instances, and FN represents the number of positive instances incorrectly classified as negative instances.
The Recall is the percentage of positive instances correctly classified as belonging to the positive class as shown in Equation (7).
The Precision is the number of correctly classified positive instances divided by the total number of instances classified as positive as shown in Equation (8).
The F-Measure is the harmonic mean that combines both precision and recall as shown in Equation (9).In this section, the performances of machine learning techniques after applying the proposed hybrid filter-GA feature selection method were compared to the standalone machine learning techniques and the machine learning techniques by considering only filter feature selection methods.
Figures 2-5 and Tables 3-6 show the comparison of the classification results for machine learning techniques using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on four highdimensional datasets: Brain, Breast Cancer, Lung, and CNS datasets.In the filter methods, the best classification results were achieved by training the machine learning techniques with the top 5% of ranked features on four datasets.
For the Brain dataset, Figure 2 and Table 3 show that the classification accuracies of SVM (69.05%),NB (69.05%), kNN (78.57%),DT (50%), and RF (78.57%) were enhanced by applying IG to 73.81%, 88.1%, 80.95%, 61.9%, and 90.48%, while enhanced by applying IGR to 78.57%, 85.71%, 83.33%, 64.29%, and 92.86%, and improved by applying CS to 83.33%, 83.33%, 80.95%, 69.05%, and 88.1, respectively.Furthermore, the proposed hybrid IG-GA feature selection method increased further the classification accuracies of SVM, NB, kNN, DT, and RF to 85.71%, 92.86%, 92.86%, 85.71%, and 100%, while they were enhanced by the proposed hybrid IGR-GA feature selection method to 97.62%, 95.24%, 97.62%, 88.1%, and 100%, respectively.In addition, Figure 2 and Table 3 show that the proposed hybrid CS-GA feature selection method increased further the classification accuracies of SVM, NB, kNN, DT, and RF to 97.62%, 95.24%, 97.62%, 85.71%, and 100%, respectively.In terms of Recall and Precision, the results shown in Table 3 demonstrate that SVM, NB, kNN, DT, and RF that applied the proposed hybrid filter-GA feature selection methods achieved better performance compared to the performance of the stand-alone classifiers or their performances with considering only filter algorithms.Consequently, SVM, NB, kNN, DT, and RF with considering the proposed hybrid filter-GA feature selection methods produced the best F-measure among other approaches since F-measure combines both precision and recall.It can be noticed also from Figure 2 and Table 3 that RF after applying the proposed hybrid IG-GA, IGR-GA and CS-GA methods accomplished the best performance among the classifiers with other feature selection methods.For Breast Cancer dataset, Figure 3 and Table 4 show that IG contributed to improving the classification accuracies of SVM (52.58%),NB (48.45%), kNN (55.67%),DT (57.73%) and RF (63.92%) to 74.23%, 55.67%, 71.13%, 67.01%, and 86.6%, while they were improved by applying IGR to 69.07%, 54.64%, 64.95%, 60.82%, and 87.63%, respectively.Furthermore, SVM, NB, kNN, DT, and RF were enhanced by applying CS to 73.2%, 72.16%, 72.16%, 69.07%, and 81.44%, respectively.Figure 3 and Table 4 also show that the SVM, NB, kNN, DT, and RF were enhanced further by the proposed hybrid IG-GA, IGR-GA, and CS-GA methods compared to using only filter algorithms.The classification accuracies of SVM, NB, kNN, DT, and RF were enhanced further by the proposed hybrid IG-GA method to 84.54%, 57.73%, 89.69%, 86.6%, and 89.69%, while improved by the proposed hybrid IGR-GA method to 82.47%, 62.89%, 86.6%, 90.72%, and 93.81%, and enhanced by the proposed hybrid CS-GA method to 82.47%, 79.38%, 84.54%, 84.54%, and 85.57%, respectively.In addition to the classification accuracy, Table 4 shows the performance in terms of Recall and Precision, and F-measure of SVM, NB, kNN, DT, and RF before and after applying the proposed hybrid filter-GA feature selection methods.As can be observed from results in Table 4, Recall and Precision, and F-measure of SVM, NB, kNN, and DT were remarkably enhanced by applying the proposed hybrid filter-GA, compared to performances of the stand-alone classifiers or their performances with considering only filter algorithms.From Figure 3 and Table 4, we can observe also that RF and DT after employing the proposed hybrid IGR-GA method achieved the best performance among the classifiers that applied other feature selection methods.For Lung Cancer dataset, Figure 4 and Table 5 demonstrate that the classification accuracies of SVM (78.82%),NB (90.15%), and RF (83.74%) were enhanced by applying IG to 92.12%, 95.07%, and 93.6%, while they were enhanced by applying IGR to 83.25%, 93.6%, and 91.13%, respectively.In addition, they are enhanced by applying CS to 84.24%, 92.12%, and 92.61%, respectively.Figure 4 and Table 5 also show that the performances of kNN and DT after applying IG, IGR, and CS were almost the same or slightly higher than the performances of the stand-alone kNN and DT.Compared to using only filter algorithms, the proposed hybrid IG-GA, IGR-GA, and CS-GA methods achieved substantially better classification results.The proposed hybrid IG-GA method increased further the classification accuracies of SVM, NB, kNN, DT, and RF to 94.09%, 98.52%, 97.04%, 96.55%, and 96.06%, while they were enhanced by applying the proposed hybrid IGR-GA method to 94.58%, 97.54%, 96.06%, 96.06%, and 95.57%, respectively.Furthermore, they were enhanced by applying the proposed hybrid CS-GA method to 95.07%, 97.04%, 95.57%, 96.55%, and 96.06%, respectively.In terms of Recall and Precision, and F-measure, Table 5 shows that SVM, NB, kNN, DT, and RF with applying the proposed hybrid filter-GA methods performed significantly better Recall and Precision, and F-measure compared to the stand-alone SVM, NB, kNN, DT and RF, and their performances with considering only For Breast Cancer dataset, Figure 3 and Table 4 show that IG contributed to improving the classification accuracies of SVM (52.58%),NB (48.45%), kNN (55.67%),DT (57.73%) and RF (63.92%) to 74.23%, 55.67%, 71.13%, 67.01%, and 86.6%, while they were improved by applying IGR to 69.07%, 54.64%, 64.95%, 60.82%, and 87.63%, respectively.Furthermore, SVM, NB, kNN, DT, and RF were enhanced by applying CS to 73.2%, 72.16%, 72.16%, 69.07%, and 81.44%, respectively.Figure 3 and Table 4 also show that the SVM, NB, kNN, DT, and RF were enhanced further by the proposed hybrid IG-GA, IGR-GA, and CS-GA methods compared to using only filter algorithms.The classification accuracies of SVM, NB, kNN, DT, and RF were enhanced further by the proposed hybrid IG-GA method to 84.54%, 57.73%, 89.69%, 86.6%, and 89.69%, while improved by the proposed hybrid IGR-GA method to 82.47%, 62.89%, 86.6%, 90.72%, and 93.81%, and enhanced by the proposed hybrid CS-GA method to 82.47%, 79.38%, 84.54%, 84.54%, and 85.57%, respectively.In addition to the classification accuracy, Table 4 shows the performance in terms of Recall and Precision, and F-measure of SVM, NB, kNN, DT, and RF before and after applying the proposed hybrid filter-GA feature selection methods.As can be observed from results in Table 4, Recall and Precision, and F-measure of SVM, NB, kNN, and DT were remarkably enhanced by applying the proposed hybrid filter-GA, compared to performances of the stand-alone classifiers or their performances with considering only filter algorithms.From Figure 3 and Table 4, we can observe also that RF and DT after employing the proposed hybrid IGR-GA method achieved the best performance among the classifiers that applied other feature selection methods.
For Lung Cancer dataset, Figure 4 and Table 5 demonstrate that the classification accuracies of SVM (78.82%),NB (90.15%), and RF (83.74%) were enhanced by applying IG to 92.12%, 95.07%, and 93.6%, while they were enhanced by applying IGR to 83.25%, 93.6%, and 91.13%, respectively.In addition, they are enhanced by applying CS to 84.24%, 92.12%, and 92.61%, respectively.Figure 4 and Table 5 also show that the performances of kNN and DT after applying IG, IGR, and CS were almost the same or slightly higher than the performances of the stand-alone kNN and DT.Compared to using only filter algorithms, the proposed hybrid IG-GA, IGR-GA, and CS-GA methods achieved substantially better classification results.The proposed hybrid IG-GA method increased further the classification accuracies of SVM, NB, kNN, DT, and RF to 94.09%, 98.52%, 97.04%, 96.55%, and 96.06%, while they were enhanced by applying the proposed hybrid IGR-GA method to 94.58%, 97.54%, 96.06%, 96.06%, and 95.57%, respectively.Furthermore, they were enhanced by applying the proposed hybrid CS-GA method to 95.07%, 97.04%, 95.57%, 96.55%, and 96.06%, respectively.In terms of Recall and Precision, and F-measure, Table 5 shows that SVM, NB, kNN, DT, and RF with applying the proposed hybrid filter-GA methods performed significantly better Recall and Precision, and F-measure compared to the stand-alone SVM, NB, kNN, DT and RF, and their performances with considering only filter algorithms.As can be seen also in Figure 4 and Table 5, NB classifier after applying the proposed hybrid IG-GA, IGR-GA and CS-GA, and kNN classifier after employing the proposed hybrid IG-GA method achieved the best performance among the classifiers that applied other feature selection methods.
Processes 2023, 11, x FOR PEER REVIEW 13 of 21 83.33%, 88.33%, 88.33%, and 88.33%, respectively.In addition to enhancing the classification accuracy, results in Table 6 demonstrate that Recall and Precision, and F-measure of SVM, NB, kNN, and DT were outstandingly enhanced by applying the proposed hybrid filter-GA, compared to performances of the stand-alone classifiers or their performances with considering only filter algorithms.As can be noticed also in Figure 5 and Table 6, the kNN and DT classifier after applying the proposed IG-GA, and DT classifier after applying IGR-GA methods accomplished the best performance among the classifiers that applied other feature selection methods.For the CNS dataset, Figure 5 and Table 6 demonstrate that IG contributed to improving the classification accuracies of NB (61.67%), kNN (61.67%),DT (58.33%), and RF (53.33%) to 75%, 75%, 70%, and 80% while they were improved by applying IGR to 78.33%, 68.33%, 68.33%, and 80%, respectively.Furthermore, they were enhanced by applying CS to 70%, 75%, 61.67%, and 83.33%, respectively.It can be observed that the performance of SVM was not enhanced by applying IG, IGR, or CS.Table 6 and Figure 5 also show that further improvements were conducted using the proposed hybrid IG-GA, IGR-GA, and CS-GA methods.The classification accuracies of SVM, NB, kNN, DT, and RF were enhanced further by the proposed hybrid IG-GA method to 86.67%, 90%, 93.33%, 93.33%, and 91.67%, while enhanced by the proposed hybrid IGR-GA method to 65%, 88.33%, 83.33%, 93.33%, and 90%, and enhanced by the proposed hybrid CS-GA method to 83.33%, 83.33%, 88.33%, 88.33%, and 88.33%, respectively.In addition to enhancing the classification accuracy, results in Table 6 demonstrate that Recall and Precision, and F-measure of SVM, NB, kNN, and DT were outstandingly enhanced by applying the proposed hybrid filter-GA, compared to performances of the stand-alone classifiers or their performances with considering only filter algorithms.As can be noticed also in Figure 5 and Table 6, the kNN and DT classifier after applying the proposed IG-GA, and DT classifier after applying IGR-GA methods accomplished the best performance among the classifiers that applied other feature selection methods.
SVM, NB, kNN, and DT were outstandingly enhanced by applying the proposed hybrid filter-GA, compared to performances of the stand-alone classifiers or their performances with considering only filter algorithms.As can be noticed also in Figure 5 and Table 6, the kNN and DT classifier after applying the proposed IG-GA, and DT classifier after applying IGR-GA methods accomplished the best performance among the classifiers that applied other feature selection methods.Table 3.The performance comparison of classifiers using all features, the features selected by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Brain dataset.As mentioned in Section 6.1, the four datasets used in this study were high-dimensional datasets with too many features.So, it was a critical step to remove the irrelevant and redundant features, and then select only the optimal feature subsets that could be utilized in improving the performance of the machine learning techniques.In our experiments, the top 5% of features ranked by IG, IGR, and CS were selected to remove the redundant and irrelevant features in order to reduce the high dimensional datasets.Then, GA was utilized to further find the more significant and relevant features and eliminate less relevant features in order to maximize the performance of the machine learning techniques.
Table 7 shows the number of features selected by the proposed hybrid filter-GA feature selection method compared to considering only filter IG, IGR and CS methods on four datasets.As can be seen, the IG, IGR, and CS contributed to reducing the number of features of Brain, Breast Cancer, Lung, and CNS datasets from 5597, 24,481, 12,600, and 7129 features to 280, 1224, 630, and 356, respectively.Furthermore, for Brain data, only 135 significant features on average were selected by both the proposed hybrid IG-GA and hybrid IGR-GA methods while 139 important features on average were selected by the proposed hybrid CS-GA method.For the Breast Cancer dataset, only 617, 616, and 623 significant features on average were selected by the proposed hybrid IG-GA, IGR-GA, and CS-GA methods, respectively.For the Lung dataset, only 326, 318, and 320 influential features on average were selected by the proposed hybrid IG-GA, IGR-GA, and CS-GA methods, respectively.For the CNS dataset, only 183, 179, and 172 relevant features on average were selected by the proposed hybrid IG-GA, IGR-GA, and CS-GA methods, respectively.In this section, the proposed hybrid filter-genetic feature selection methods are compared to several previous studies that applied filter-based and hybrid-based feature selection to reduce the dimensionality of the used four microarray datasets: Brain, Breast, Lung, and CNS datasets.Hameed et al. [35] suggested applying PCC-GA and PCC-BPSO in the microarray datasets that combined Pearson's Correlation Coefficient (PCC) with Genetic Algorithm (GA) or Binary Particle Swarm Optimization (BPSO).A fusion-based feature selection method was used by [36] on the microarray datasets to improve the effectiveness of cancer classification.Recently, the authors in [37] applied a hybrid feature selection method on these four microarray datasets.The method combined the Gini index and support vector machine with Recursive Feature Elimination (GI-SVM-RFE).
The performances of the proposed hybrid filter-genetic feature selection methods were compared against these related studies that used the same datasets.The accuracy comparison reported in Table 8 shows the superior performance of the proposed methods on most of the datasets used in this study.As can be seen in Table 8, the experimental results demonstrated that the proposed IG-GA, IGR-GA, and CS-GA methods outperformed the competitor methods for the five machine learning algorithms on most of the datasets used in this study.3-6 show almost all the standalone machine learning techniques trained with all features did not obtain good classification results on four datasets as these datasets suffer from the curse of dimensionality.This was expected since they trained with redundant and irrelevant features.Furthermore, Figures 2-5 and Tables 3-6 also show that IG, IGR, and CS filter methods contributed to improving the performance of SVM, NB, kNN, DT, and RF on most of the datasets used in this study.However, the performances of SVM, NB, kNN, DT, and RF with considering only IG, IGR, and CS were not good enough since the filter methods usually evaluate features independently of a classifier and ignore the relationship between features.On the other hand, the training of the classifier and the relationship between features are taken into consideration during the process of feature selection in the proposed hybrid filter-GA methods.Therefore, SVM, NB, kNN, DT, and RF with considering the proposed hybrid filter-GA feature selection methods achieved considerably better performance compared to the performance of the stand-alone classifiers or their performances with considering only filter algorithms.
The results in Table 7 show that the IG, IGR, and CS accomplished better classification results although we used only the top 5% of features.The IG, IGR, and CS contributed to reducing the number of features of Brain, Breast Cancer, Lung, and CNS datasets to 280, 1224, 630, and 356 features, respectively.
The results in Table 7 also demonstrate that the GA omitted about 50% of the irrelevant features from the datasets that were reduced by filter methods in the first step, and only fewer important attributes were used for training the machine learning techniques.Thus, after applying the proposed hybrid filter-GA feature selection methods, the classifiers accomplished better classification results with only a smaller number of features because the proposed hybrid IG-GA, IGR-GA, and CS-GA methods were capable of excluding redundant, irrelevant, and less relevant features.
From Tables 3-6, we can also observe that the machine learning that applied the proposed hybrid IG-GA, IGR-GA, and CS-GA methods achieved higher classification results in cancer datasets with smaller number of the selected features such as brain, CNS and lung dataset.On the other hand, the higher redundant and irrelevant features included in the breast data caused less improvements in classification results of the machine learning techniques used in this study.
The results in Table 8 show that the proposed IG-GA, IGR-GA, and CS-GA methods achieved better performance than previous studies that applied filter-based and hybridbased feature selection for most of the microarray datasets.Table 8 also shows the proposed IG-GA, IGR-GA, and CS-GA methods performed well but they did not produce the best results in the CNS dataset among the compared other methods.However, the proposed IG-GA, IGR-GA, and CS-GA methods contributed to achieving better classification results of classifiers in CNS dataset as shown in Table 6 compared to stand-alone classifiers and considering only filter algorithms.Furthermore, the proposed IG-GA, IGR-GA, and CS-GA methods were effectively utilized in reducing the number of features and eliminating irrelevant and redundant features in the CNS dataset as shown in Table 7.

Conclusions and Future Work
To overcome the difficulties arising from the high-dimensional microarray datasets, this paper suggested a hybridization of filter feature selection methods and GA-based feature selection method.In the first phase of the proposed hybrid filter-genetic feature selection approach, the top 5% of features ranked by information gain, gain ratio, and Chi-squared are selected while the other redundant and irrelevant are eliminated to reduce the high dimensional microarray datasets.The cancer classification performances of the machine learning techniques with considering only filter feature selection methods are not sufficiently accurate since the features are evaluated based on certain criteria independently of a machine learning algorithm.Therefore, in the second phase of the proposed hybrid filter-genetic feature selection approach, the reduced datasets with only the top-ranked features selected by the filter methods are further optimized by the GA to achieve better cancer classification results.The experimental results demonstrated that the GA in the proposed methods omitted about 50% of irrelevant features, from datasets reduced by the filter methods in the first step, and only the remaining important features were used to maximize the cancer classification performance of the classifiers.In addition, the proposed hybrid filter-GA feature selection methods achieved significantly better performances compared to the performance of the stand-alone classifiers or their performances with considering only filter algorithms.Furthermore, the proposed hybrid filter-GA approach outperformed other existing feature selection methods on most of the high-dimensional microarray datasets used in this study.The future work of this research will focus on utilizing other filter feature selection methods with other evolutionary algorithms for further enhancement of cancer classification for high-dimensional microarray datasets.

Figure 1 .
Figure 1.The methodology of the hybrid filter genetic algorithm-based feature selection approach proposed for cancer classification in high-dimensional microarray datasets.

Figure 1 .
Figure 1.The methodology of the hybrid filter genetic algorithm-based feature selection approach proposed for cancer classification in high-dimensional microarray datasets.

Algorithm 1 : 2 5 6 / 7 While termination criteria not meet do 8
The pseudocode of the hybrid filter genetic algorithm-based feature selection approach Input: F: Original feature set N: Size of population (Number of chromosomes) Output: SF: The optimal selected features 1 Begin Compute score of each feature in F using Information gain, Gain ratio, or Chi-squared 3 RF = Select only the top 5% of ranked features 4 D = Dimension of RF Initialize population P by generating N chromosomes C including D genes(features) with random values [0, 1] for each gene g / Convert chromosomes to binary chromosomes (If the feature is selected, g = 1; otherwise, g = 0) If g >= 0.5 then g = 1; otherwise, g = 0 Compute fitness value (classification accuracy) for each chromosome 9 Select two parents based on better fitness values

Figure 2 .
Figure 2. The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Brain dataset.

Figure 2 .
Figure 2. The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Brain dataset.

Figure 3 .
Figure 3.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Breast Cancer dataset.

Figure 3 .
Figure 3.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Breast Cancer dataset.

Figure 4 .
Figure 4.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Lung Cancer dataset.

Figure 4 .
Figure 4.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Lung Cancer dataset.

Figure 4 .
Figure 4.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Lung Cancer dataset.

Figure 5 .
Figure 5.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the CNS dataset.

Figure 5 .
Figure 5.The accuracy comparison of classifiers using all features, the features selected only by filters methods, and the features selected by the proposed hybrid filter-GA methods on the CNS dataset.

Table 1 .
Description of the high-dimensional microarray datasets used in this study.

Table 2 .
Settings of GA parameters used in the proposed hybrid filter-GA feature selection method on all experimental datasets.

Table 4 .
The performance comparison of classifiers using all features, the features selected by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Breast dataset.

Table 5 .
The performance comparison of classifiers using all features, the features selected by filters methods, and the features selected by the proposed hybrid filter-GA methods on the Lung dataset.

Table 6 .
The performance comparison of classifiers using all features, the features selected by filters methods, and the features selected by the proposed hybrid filter-GA methods d on the CNS dataset.

Table 7 .
Comparison of the selected features numbers before and after applying the proposed hybrid filter-GA feature selection method.

Table 8 .
The accuracy comparison of the proposed hybrid filter-genetic feature selection methods with the previous studies.