Optimization for Gene Selection and Cancer Classification

: Recently, gene selection has played an important role in cancer diagnosis and classification. In this study, it was studied to select high descriptive genes for use in cancer diagnosis in order to develop a classification analysis for cancer diagnosis using microarray data. For this purpose, comparative analysis and intersections of six different methods obtained by using two feature selection algorithms and three search algorithms are presented. As a result of the six different feature subset selection methods applied, it was seen that instead of 15,155 genes, 24 genes should be focused. In this case, cancer diagnosis may be possible using 24 candidate genes that have been reduced, rather than similar studies involving larger features. However, in order to see the diagnostic success of diagnoses made using these candidate genes, they should be examined in a wet laboratory.


Introduction
The DNA microarray enables us to understand the structure of many genes that provide information about the physiological processes and disease etiology mediated by these genes. Regulation of a gene expression occurs during the adaptation of DNA to reporter ribonucleic acid (mRNA). DNA microarrays are a tool used for the identification and measurement of mRNA transcripts found in cells [1].
Microarray datasets play an important role in cancer detection. However, the large size of these datasets makes classification difficult due to the presence of many irrelevant and unnecessary features. For this reason, feature (gene) selection has a very important place in this field thanks to its ability to remove features that are not required from the existing structure.
The microarray gene expression dataset contains information on the expression levels of genes in the particular tissue and cell. These data are used as a key source of information in different biological studies and analyzes. Therefore, microarray data are very useful in the field of tumor and cancerous gene detection.
Microarray data generally include expression profiles of genes for both cancerous (tumor) and non-cancerous (normal) cells. Proper analysis will help the medical doctor and drug designer identify the genes responsible for cancers and take action before the disease becomes incurable. Therefore, microarray gene expression data are important because treatment becomes easier after detection [2].
Generally, the microarray data contain a small number of samples (around 100) and a large number of features (approximately 6000 to 60,000) that lead to the "curse of dimensionality" [3]. Most features are unnecessary and/or irrelevant in such data. Because expression values can indicate the occurrence of cancer, the features that are most relevant are called biomarkers. Hence, finding biomarkers is an important research problem. Irrelevant features increase the accuracy and calculation time of the cancer detection system. In short, all features (genes) are not responsible for cancer, only a very small fraction of the total number of genes cause cancer. This, in turn, expresses the importance of the choice of features that eliminate irrelevant and/or unnecessary data in the dataset and make detection faster and more accurate [2,4].
Different feature selection methods in the literature determine the optimal features differently. Therefore, different results occur when different methods are applied one by one. If we apply a number of methods separately and take the combination or intersection of the results we get from these methods, we not only get the most important information from all methods, but also increase our chances of improvement in the prediction performance of the system. Therefore, the purpose of combining multiple feature selection methods is to increase the maximum accuracy achieved with a single method. Because, the combination can overcome the errors of other methods in different parts of the input field while increasing accuracy by providing complementary views on the importance of features [2].
For this purpose, in this study, subsets of features were selected using six different methods obtained by using two feature selection algorithms, and three search algorithms, and classification studies with them were performed and the results were examined. In addition, the intersections of these six different feature subsets were also examined. In this study, the Ovarian cancer dataset produced as a result of a study by Zhu et al. (2007) was used. This dataset is accessible to researchers.
Ovarian cancer is one of the most common gynecological cancers with the highest mortality rate. It is the eighth most common cancer among women in the world and the 18th most common cancer in general [5].
Ovarian cancer arises at advanced clinical stages in more than 80% of patients and is associated with 5-year survival in 35% of this number. In contrast, 5-year survival exceeds 90% for patients with stage I ovarian cancer, and most patients treat their disease with surgery alone. Therefore, increasing the number of women diagnosed with stage I disease is expected to have a direct impact on the mortality and economy of this cancer without the need to change the approaches used in surgery or chemotherapy [6].
The American Cancer Society estimated that in 2020, 21,750 new female cases of ovarian cancer will be detected in the United States, and 13,940 of these cases will die of ovarian cancer. In addition, a woman's lifetime risk of developing ovarian cancer was expressed as approximately 1/78 and the lifetime rate of dying from ovarian cancer as 1/108 [7].

Research Methodology
In this section, firstly, the dataset used in the study will be explained. Later, data processing methods and algorithms we use will be discussed in detail.

Ovarian Dataset Description
The Ovarian cancer dataset (8-7-02) used in this research was produced as a result of a study by Zhu et al. (2007). Researchers can easily access this dataset from Reference [8]. The mentioned dataset consists of 15154 genes (features), 253 observations and 2 classes. The current observation group consists of 162 people with the disease and 91 healthy people. This dataset was produced using the WCX2 protein chip and is very different from the Ovarian cancer dataset (4-3-02).

Algorithms
In this research, six different methods obtained by using two feature selection algorithms, and three search algorithms were evaluated. The classification algorithm Support Vector Machine (SVM), Random Forest (RF) and Decision Tree (DT) were evaluated through the ovarian dataset. These algorithms are summarized briefly in Table 1.

Algorithms
Descriptions Feature Selection Algorithms Consistency Based FS Works with the principle of choosing a consistency based feature subset [9] Correlation Based FS Sort features based on a correlation-based evaluation function [10][11][12] Search Algorithms

Genetic Search
Performs a search using the simple genetic algorithm described in [13] Best First Heuristic search method that searches the domain of feature subsets with greedy hill climbing enriched with a backtracking facility [14] Rank Search It is a search method that works with the sorter search principle [15] Classification Algorithms Support Vector Machine Perform classification with the help of a linear or nonlinear function [16] Random Forest Algorithm Random forests are a collection of tree-type classifications based on the idea of using a forest for classification purposes [17] Decision Tree It creates a tree-shaped structure in order to make a decision [16,18]

Experimental Analysis
In this section, we discuss data preprocessing, classification stage and feature selection studies.

Data Preprocessing
The Ovarian cancer dataset used in this research was produced as a result of a research by Zhu et al. (2007). The Ovarian dataset was downloaded as an Arff file and the Weka libraries on Python were used to read and study this file. After reading the Arff file, the data was converted to the Dataframe format in the Pandas library and the examination phase was started. Later, the types of features were examined and it was seen that only the feature named 'Class' was categorical and the other features were numeric (See Table  2).
When the Class feature was examined, it was seen that there were a total of 253 observations and two classes. In addition, it was observed that the current observation group consisted of 162 people with the disease and 91 healthy people (See Figure 1).  When the averages, quarters, minimum and maximum values of the columns were examined, it was observed that all columns were compressed between 0 and 1, but the averages and distributions differed. Descriptive statistics can be seen in Table 3. Extreme values were checked with the box chart. Due to the compression of the data between 0 and 1, it was determined that the samples with the limit values were inconsistent, but these samples were not excluded from the data due to the small number of samples. This process is illustrated in Figure 2. Before the classification study, by applying z-score normalization, a relatively better distribution of the data was attempted. Later, it was checked whether there were any missing records in all the data and it was seen that there were no missing records.

Classification Stage
In the classification phase of our study, firstly, the current data were classified using three different algorithms, including five-fold cross validation SVM, RF, and DT. When using RF on the Ovarian dataset, as seen in Table 4, 98.8% classification accuracy and 0.98809 F-score values were reached. With the classification study using DT on the Ovarian dataset, 95.7% classification accuracy and 0.957 F-score values were achieved, as seen in Table 5. According to the RF algorithm on the lower levels it was observed to obtain the classification performance.
In the case of SVM, as seen in Table 6, 98.8% classification accuracy and 0.98812 Fscore values were achieved. It has been observed that these values are very similar to the values obtained by the study performed with the RF Algorithm.  A summary of three different classification studies can be seen in Table 7.

Feature Selection Studies
The possible feature subset space for the Ovarian dataset was computed as a 4562 digit number, which is the equivalent of 2 15153 .
On the existing Ovarian dataset, Correlation Based Feature Selection and Consistency Based Feature Selection algorithms and six different feature subsets selected by Cartesian matches of Best First, Genetic Search and Rank Search algorithms were conducted. The GainRatioAttributeEval algorithm, which the Rank Search algorithm uses as the default algorithm to determine ranking scores, has been preferred. These algorithms are called with the Weka library on Python.
The abbreviations given in Table 8 were created to facilitate analysis. As can be seen from Table 8, six different feature selection applications were made on the Ovarian data and the selected feature numbers were obtained as in Table 9.  When Table 9 is examined, it can be said that the genetic algorithm behaves more greedily because it chooses quite a lot of features compared to the other methods used. The classification studies applied on all data with the data subsets obtained as a result of matching search algorithms and feature selection algorithms were carried out and the values obtained are given in Table 10. Then, grouping was made based on the datasets in Table 10, and Table 11 was obtained by taking the average of accuracy and F-score values. The averages in Table 11 are ranked according to ACC.
When Table 11 is examined, it is seen that the classification studies performed with the data sub-sets obtained by the four different feature selection studies achieved a relatively higher accuracy rate and F-score value than the classification studies performed with all data. When compared with the other two data sub-sets, it was seen that more successful classification studies with an acceptable difference were performed with fewer features than the classification studies conducted with all features. At this stage, it can be said that all feature subset selection studies have produced good results. It has been observed that the approaches using Genetic Search create subsets containing relatively more features than others. In this study, in the applications where the Genetic Search algorithm is used, it can be said that the Genetic Search algorithm has the opportunity to be explained with less features because of the greedy behavior on this data, but it uses more features. When the averages of the results of the studies are plotted, it is obvious that very similar results are obtained (See Figure 3). In Figure 3, ACC is shown by blue, F-score is shown by orange. This shows that similar results can be achieved with far fewer features, and it can be said that the application of feature selection is beneficial for this classification study.  Table 12 was created to examine whether all the selected genes were selected by which applications and their intersections. In Table 12, 0 (zero) means that the relevant gene was not selected, and 1 means that it was selected. It was observed that six different feature selection practices selected a total of 5532 features. Some of these genes have been selected in more than one application. It was calculated in how many different applications the gene in each row was selected and these calculation results were added as a column in Table 12. In order to examine the intersections, the graph in Figure 4 was obtained by using the Upset function in the UpSetR library in the R language. In this graph, the sizes of the clusters are shown in the row, which clusters intersect is shown in the points in the middle and the number of elements at the intersections is shown in the graphics and numbers at the top.
At this stage, the intersection table obtained to examine the features in at least three different clusters was filtered and shown in Table 13. As a result, it can be said that it would be beneficial to consider the genes corresponding to the 24 features in Table 13 in cancer diagnosis.
It can be seen from Figure 4 that CfsGen (Correlation Based FS and Genetic Search) alone selected 3142 genes that other algorithms did not select. Similarly, the intersection of ConGen and CfsGen has alone selected 580 genes.

Discussion and Results
In this study, in order to improve a classification study on cancer diagnosis by using microarray data, the selection of genes with high descriptiveness for use in cancer diagnosis by using feature selection methods was studied. Studies have been conducted in the literature to evaluate the intersections of different feature subsets by selecting them. In this study, intersection sets of variable subsets selected using six different methods were also examined. However, since the size of the data used in the study was not capable of representing the entire human population, the study suggested an approach, and it was not possible for the results to contain certain judgments in terms of genetics.
There are many studies and approaches to gene selection in cancer detection in the literature. Our approach in this study is to examine the selection frequencies of genes selected with different feature selection studies, and the more frequently selected genes may have higher cancer descriptors.
As a result, it has been shown that instead of trying to predict ovarian cancer over 15,155 genes, it can be predicted with 24 genes selected by the majority of the practice of selecting six different feature subsets from among 15,155 genes. Thanks to this reduction in the number of genes, instead of similar studies with larger features, cancer detection may be possible with fewer microarray data, and workforce and cost requirements can be reduced by conducting studies only for the relevant genes in the subsequent diagnostic stages. In addition, it is thought that higher diagnostic success can be achieved by excluding variables with low explanatory value from the study. However, diagnoses made using these candidate genes need to be examined in a wet laboratory to see diagnostic success.
Within the scope of future studies, it may be possible to make a wider range of evaluation and gene selection by using different feature subset selection methods and classification algorithms.
Author Contributions: In this study, H.B. prepare the model experiments, interpret the result and prepare the manuscript. E.S. contributed to the formal analysis and software. Ç.S.E. is a research advisor and she provide intitutive explanation the manuscripts. All authors have read and agreed to the published version of the manuscript.