Feature Selection for Colon Cancer Detection Using K-Means Clustering and Modiﬁed Harmony Search Algorithm

: This paper proposes a feature selection method that is effective in distinguishing colorectal cancer patients from normal individuals using K-means clustering and the modiﬁed harmony search algorithm. As the genetic cause of colorectal cancer originates from mutations in genes, it is important to classify the presence or absence of colorectal cancer through gene information. The proposed methodology consists of four steps. First, the original data are Z-normalized by data preprocessing. Candidate genes are then selected using the Fisher score. Next, one representative gene is selected from each cluster after candidate genes are clustered using K-means clustering. Finally, feature selection is carried out using the modiﬁed harmony search algorithm. The gene combination created by feature selection is then applied to the classiﬁcation model and veriﬁed using 5-fold cross-validation. The proposed model obtained a classiﬁcation accuracy of up to 94.36%. Furthermore, on comparing the proposed method with other methods, we prove that the proposed method performs well in classifying colorectal cancer. Moreover, we believe that the proposed model can be applied not only to colorectal cancer but also to other gene-related diseases. Author Contributions: Conceptualization, J.H.B.; methodology, J.H.B. and M.K.; software and experiments, J.H.B. and M.K.; writing—original draft preparation, J.H.B.; writing—review and editing, J.H.B. and Z.W.G.; supervision, J.S.L. and Z.W.G.; funding acquisition, Z.W.G. authors


Introduction
Colorectal cancer (CRC) is the third most common cause of cancer mortality and accounts for 11% of all cancer diagnoses worldwide [1,2]. Gender-wise, CRC is the third most common cancer among men and the second most among women [3]. Furthermore, as the incidence rate of young people with CRC is gradually increasing, the average age of people with CRC is also decreasing. The average age for CRC diagnosis in the United States was 72 years between 2001 and 2002, which decreased to 66 years between 2015 and 2016 [4]. Therefore, the importance of early diagnosis of CRC is being increasingly felt.
The major causes of CRC are smoking, obesity, and poor lifestyle and eating habits, all of which are acquired factors. It has been statistically shown that the risk for CRC is higher in developed countries [5]. Excessive consumption of animal fat and meat, especially red meat, acts as a risk factor for CRC. Nevertheless, cancer incidence due to various acquired or environmental factors can be substantially reduced by changing lifestyle patterns.
Meanwhile, genetic factors account for 10-30% of all CRC cases. However, the incidence of CRC due to genetic factors is significantly higher than that due to acquired factors. Representative examples include familiar adenomatous polyposis (FAP) and hereditary non-polyposis colorectal cancer (HNPCC). FAP causes several or thousands of adenomas to develop on the wall of the colon, and almost 100% of them develop into cancer in adulthood. Considering 95% of patients develop cancer before 45 years of age, prevention through early diagnosis is necessary. HNPCC develops at an early age and is more common than FAP, and the risk of CRC in immediate family members increases by 2-3 times [6]. Therefore, it is important to identify through testing the genes involved in the development of CRC.

Related Work
DNA information is an important factor in predicting genetic diseases. However, diagnosis can be difficult in unpredictable situations due to the large amount of data or genetic mutations. In recent years, with the progress made in the field of artificial intelligence, research on predicting diseases using only biological data has been actively conducted. Several studies have predicted CRC using the information on CRC genes published by the Princeton University Gene Expression Project.
In the above study, data were analyzed with random ensembles, and a support vector machine (SVM) was used as a classifier to predict CRC based on cancer gene information [12]. They created a random ensemble application using a new C++ class and the NEURObjects library [13].

Materials and Methods
For the 6500 human genes provided in [15], the expression levels of 40 tumors and 22 normal colon tissues were used. In this study, we used the information of 2000 genes with the highest minimum intensity among all samples. We attempted to classify CRC using the information on 2000 CRC genes from 62 people provided by the Princeton University Gene Expression Project. All data used in the experiment is either 3 UTR or gene. 3 UTR strictly controls gene expression in normal cells [16]. Figure 1 represents the step-by-step process proposed in this paper. The parameters and their corresponding values used in the experiment are described in the process of each step. First, the data are normalized using Z-normalization [17]. Candidate genes are selected using the Fisher score with normalized data. Next, candidate genes selected using K-means clustering are classified, after which representative genes to be used for CRC prediction are selected within each cluster. Finally, the selected representative genes are searched for multiple gene combinations using the HS method. The combination obtained is then verified using a 5-fold validation method.
• Z-normalization Normalization plays a role in reflecting all data values with the same degree of importance. The formula for Z-normalization is the same as in (1), where is the original normalized value, is the mean of the data, and is the standard deviation of the data. As for the normalized value, the mean of the genetic information values has a significant influence on the normalization. If the extracted value matches the mean of the genetic information, it is normalized to zero. If the extracted value is less than the mean, it is normalized to a negative number, and if the extracted value is greater than the mean, it is normalized to a positive number. The normalized negative and positive numbers are determined by the standard deviation of the genetic information value. If the range of the data values is large, that is, if the standard deviation is large, the normalized value approaches 0.
We normalized by substituting the original genetic information value into (1). Table  1 shows the values applied with Z-normalization of Attribute 1-one of each genetic information values for each patient. The number of patients included in the actual experiment was 62, but Table 1 only shows, as an example, the value of Attribute 1 for 12 patients. The average of Attribute 1 gene information of 62 people was 7015.78671. The average of Attribute 1 is subtracted from the patient's genetic information value and divided by the standard deviation of the gene information, 3092.970584. As a result of this, a normalized number is obtained, as listed in Table 1, which applies to all data. First, the data are normalized using Z-normalization [17]. Candidate genes are selected using the Fisher score with normalized data. Next, candidate genes selected using K-means clustering are classified, after which representative genes to be used for CRC prediction are selected within each cluster. Finally, the selected representative genes are searched for multiple gene combinations using the HS method. The combination obtained is then verified using a 5-fold validation method. • Z-normalization Normalization plays a role in reflecting all data values with the same degree of importance. The formula for Z-normalization is the same as in (1), where x is the original normalized value, µ is the mean of the data, and σ is the standard deviation of the data. As for the normalized value, the mean of the genetic information values has a significant influence on the normalization. If the extracted value matches the mean of the genetic information, it is normalized to zero. If the extracted value is less than the mean, it is normalized to a negative number, and if the extracted value is greater than the mean, it is normalized to a positive number. The normalized negative and positive numbers are determined by the standard deviation of the genetic information value. If the range of the data values is large, that is, if the standard deviation is large, the normalized value approaches 0.
We normalized by substituting the original genetic information value into (1). Table 1 shows the values applied with Z-normalization of Attribute 1-one of each genetic information values for each patient. The number of patients included in the actual experiment was 62, but Table 1 only shows, as an example, the value of Attribute 1 for 12 patients. The average of Attribute 1 gene information of 62 people was 7015.78671. The average of Attribute 1 is subtracted from the patient's genetic information value and divided by the standard deviation of the gene information, 3092.970584. As a result of this, a normalized number is obtained, as listed in Table 1, which applies to all data. • Fisher Score The combination to consider for selecting a small number of genes that distinguish colorectal cancer with 2000 genetic information values is near infinite. The main purpose of this process is to select candidate genes that are easy to classify using the Fisher score. Incidentally, this process reduces the number of combinations and serves as the basis for the selection of representative genes. It also reduces redundancy for genes with similar characteristics and reduces the time complexity for experiments. The Fisher score is one of Newton's methods and is used for maximum likelihood estimation in statistics [18]. The score calculated using the Fisher score is represented by (2). X i A and X i B indicate the average gene information value of gene i for a normal person and a person with a cancer gene, respectively, and σ i A and σ i B indicate the standard deviation of the gene i for a normal person and a person with a cancer gene, respectively.
Here, the i data refer to the gene information values of normal people and patients with CRC gene information. As the original data are already labeled for classification, the level of expression of gene i for classification can be evaluated. As the Fisher score increases, the difference between the distribution of the ith class and the jth class also increases. Therefore, we selected the top 1000 genes in the order of the highest Fisher score as candidate genes to be used in the next feature selection step. • K-means Clustering We used K-means clustering, an unsupervised learning method, to find representative genes from 1000 candidate genes selected using the Fisher score. In K-means clustering, clusters are created based on the nearest centroid, that is, the mean, in a group. Here, K-means clustering is carried out using the average of the data. When n data of (x 1 , x 2 , ..., x n ) are divided into k clusters, the process can be expressed as From each cluster's data, the sum of the distances to the mean of the cluster is squared, and each value must be obtained when C becomes the minimum. U k means that the vector belongs to the kth cluster and is placed in the center of the kth cluster. Therefore, the first U k is an arbitrary initial value and is the center of the cluster. After fixing the U k value, the r nk value that minimizes C is found. When x n belongs to the kth cluster, the value of r nk is 1; otherwise, it is 0. When the value of r nk is obtained, the newly obtained value of r nk is fixed and U k is determined again. This process is repeated for a predetermined number of times or until the result of repetitive learning becomes meaningless.
In this study, the number of clusters was set to 20. The cluster consists of samples divided for the classification of CRC. Using all 1000 genes for feature selections, 20 representative genes were selected to account for diversity. In each cluster, the gene whose information data were closest to the median value was designated as the representative gene of the cluster. We used the cosine distance to calculate the distance between the data and the median. The cosine distance between u and v can be calculated using (4). The weights for each value is u and v. We compute the cosine distance using a scipy.spatial.distance library.
The HS algorithm is an evolutionary computation algorithm inspired by the process involved in musicians' improvising a harmony. Harmony Search is being applied to research using biodata. Hickmann et al. conducted a weekly prediction of seasonal influenza based on Wikipedia access and CDC influenza-like illness (ILI) reports [19]. They formed 50% and 95% confidence intervals for the 2013-2014 ILI observations. In the HSWOA method that combines HS and WOA (Whale Optimization Algorithm), a study was conducted to show the accuracy of hybridization reactions through DNA sequence [20]. Comparative analysis was conducted with NACST/.Seq [21], DEPT [22], H-MO-TLBO [23], and MO-ABC [24], and the average fitness of HSWOA was higher than that of the four algorithms. Additionally, there is a COA-HS algorithm that combines Harmony Search with cancer gene selection [25]. Their algorithm seeks to overcome the dimensional curse problem and is aimed at selecting meaningful genes. There is also a study proposing a metaheuristic harmony search algorithm that effectively predicts the structure of RNA as well as DNA [26]. Harmony Search is also applied to studies to reduce hand tremors for Parkinson's disease rehabilitation and the intensity of magnetic fields transmitted to the brain [27]. In this study, the existing HS process was modified and used as a feature selection method. The existing HS algorithm involves a total of four steps.
Step 1. Initializing parameters and harmony memory The first step is to initialize the variables and harmony so as to implement the harmony memory. To use this algorithm, we need to know the meaning of the parameters. As HS is an evolutionary algorithm, it can be compared to a genetic algorithm. The genes, which are basic elements of the chromosome in the genetic algorithm, are the same as the musical tones, which are the basic elements of a harmony vector. Harmony memory size (HMS) refers to the number of harmonies in one harmony memory. Harmony vectors are randomly initialized at the start of the HS method implementation, and previous harmony values are used when an iteration is performed later.
Step 2. Creating a new harmony This is the stage where one can adjust the ratio for combination and create a new harmony and obtain a wide range of combinations. A group of harmonies as many as HMS is created within one harmony memory. One harmony vector is randomly selected within the same location of each harmony memory. The selected harmony vector becomes a new harmony vector at the corresponding position. New values at a location corresponding to each variable in the harmony are gathered to create a new harmony. Harmony memory considering rate (HMCR) is a probability value for creating a new harmony mentioned in the above process. 1-HMCR is the probability of randomly initializing a harmony vector when creating the first harmony, after which a new harmony is created and added to the harmony memory. The pitch adjusting rate (PAR) is the probability of providing a variation to the harmony vector. This is to obtain a diverse set of combinations.
Step 3. Updating harmony memory In this step, the newly generated harmony vector is evaluated. The importance of the harmony is tested based on the objective function value (fitness value) of the harmony. If the new harmony vector generated in Step 2 has better function value than the worst fit one in the harmony memory, the new vector is included in the harmony memory and the lowest one is removed.
Steps 2 and 3 are repeated as many times as the specified iteration. With each iteration, the harmony with the lowest fitness is removed, and thus, various combinations are generated with the harmony of high fitness.
However, we propose a new method of feature selection by modifying the existing HS. The related pseudocode is shown in Algorithm 1. generate initial Harmony 7. End for 8. Repeat 9.
x new = Randomly select from x 1J to x (BDR)J 11. end for 12.
generate new Harmony (x new ) 13.
x new = Randomly select from x (BDR+1)J to x (HMS)j 16.
x Step 1. Initializing variable and harmony To create a combination with 20 representative genes, the harmony vector is first initialized to 0 and 1. Zero means that the representative gene information value in the index is not used as a feature for classification, and 1 means that it is used as a feature for classification. HMCR is 0.9, PAR is 0.1, and the number of iterations (itr) is 500. HMS is set to 30.
Step 2. Creating new harmony and dividing harmony memory This step is a modified part of the existing HS for this study. The process of creating a new harmony memory is the same as the existing HS algorithm, but the experiment was conducted by dividing the harmony memory into two areas, as shown in Figure 2. The upper area is composed of harmonies having the fitness of the top 20% within one harmony memory. HMCR and PAR are not used for this area. Therefore, new harmony is not added by initialization. Rather than creating the diversity of combinations, when the combination is recombined within the harmony of the upper area, a combination of higher fit could be found, after which new harmonies are created. In the second area, which is the lower area in harmony memory, new harmonies are created using the existing HS algorithm, that is, by using HMCR and PAR.
Step 3. Updating harmony memory Goodness-of-fit is the classification accuracy obtained by applying the classification model used in the paper with the combination selected from the harmony. The fit is calculated according to each harmony value and is arranged in the order of the harmony with high fitness. As two new harmonies are created in Step 2, the two old harmonies with the lowest fit that are aligned as shown in Figure 3 are removed to match the size of the HMS that was initially specified. Step 4. Repeating Steps 2 and 3 There is no newly modified process at this stage. Repeat Steps 2 and 3 as many as the iteration. As the number of repetitions increases, the upper region finds harmonies with a higher degree of fitness within the combination with higher suitability, whereas the lower region maintains the advantages of the existing HS, that is, finding combinations according to diversity. As the number of iterations increases, the highest classification accuracy of two areas within one harmony memory is stored in a text file, and the accuracy changes as the iterations' progress is confirmed. •

Classification and Validation
We used an artificial neural network (ANN) as a classifier [28]. An ANN is a network created by abstracting neurons in the brain. Figure 4 shows the structure of the ANN used The upper area is composed of harmonies having the fitness of the top 20% within one harmony memory. HMCR and PAR are not used for this area. Therefore, new harmony is not added by initialization. Rather than creating the diversity of combinations, when the combination is recombined within the harmony of the upper area, a combination of higher fit could be found, after which new harmonies are created. In the second area, which is the lower area in harmony memory, new harmonies are created using the existing HS algorithm, that is, by using HMCR and PAR.
Step 3. Updating harmony memory Goodness-of-fit is the classification accuracy obtained by applying the classification model used in the paper with the combination selected from the harmony. The fit is calculated according to each harmony value and is arranged in the order of the harmony with high fitness. As two new harmonies are created in Step 2, the two old harmonies with the lowest fit that are aligned as shown in Figure 3 are removed to match the size of the HMS that was initially specified. The upper area is composed of harmonies having the fitness of the top 20% within one harmony memory. HMCR and PAR are not used for this area. Therefore, new harmony is not added by initialization. Rather than creating the diversity of combinations, when the combination is recombined within the harmony of the upper area, a combination of higher fit could be found, after which new harmonies are created. In the second area, which is the lower area in harmony memory, new harmonies are created using the existing HS algorithm, that is, by using HMCR and PAR.
Step 3. Updating harmony memory Goodness-of-fit is the classification accuracy obtained by applying the classification model used in the paper with the combination selected from the harmony. The fit is calculated according to each harmony value and is arranged in the order of the harmony with high fitness. As two new harmonies are created in Step 2, the two old harmonies with the lowest fit that are aligned as shown in Figure 3 are removed to match the size of the HMS that was initially specified. Step 4. Repeating Steps 2 and 3 There is no newly modified process at this stage. Repeat Steps 2 and 3 as many as the iteration. As the number of repetitions increases, the upper region finds harmonies with a higher degree of fitness within the combination with higher suitability, whereas the lower region maintains the advantages of the existing HS, that is, finding combinations according to diversity. As the number of iterations increases, the highest classification accuracy of two areas within one harmony memory is stored in a text file, and the accuracy changes as the iterations' progress is confirmed.

Classification and Validation
We used an artificial neural network (ANN) as a classifier [28]. An ANN is a network created by abstracting neurons in the brain. Figure 4 shows the structure of the ANN used Step 4. Repeating Steps 2 and 3 There is no newly modified process at this stage. Repeat Steps 2 and 3 as many as the iteration. As the number of repetitions increases, the upper region finds harmonies with a higher degree of fitness within the combination with higher suitability, whereas the lower region maintains the advantages of the existing HS, that is, finding combinations according to diversity. As the number of iterations increases, the highest classification accuracy of two areas within one harmony memory is stored in a text file, and the accuracy changes as the iterations' progress is confirmed.

• Classification and Validation
We used an artificial neural network (ANN) as a classifier [28]. An ANN is a network created by abstracting neurons in the brain. Figure 4 shows the structure of the ANN used in our study. The input and the hidden layers are composed of five nodes. The output layer consists of one node, and the sigmoid function is used as the activation function.
in our study. The input and the hidden layers are composed of five nodes. The output layer consists of one node, and the sigmoid function is used as the activation function. We used K-fold cross-validation as an experimental verification technique [29]. All data were used as a test set at least once to increase the reliability of data verification. Figure 5 shows the process of training and testing data divided using 5-fold cross validation. Furthermore, the combination of features selected for HS was verified through 5-fold cross-validation.

Results
A total of 1000 candidate genes selected out of 2000 genes through the Fisher score were divided into 20 clusters by using K-means clustering. The optimal number of clusters was determined using the inertia value in the scikit. Figure 6 shows the inertia value according to the number of clusters. The lower the inertia value, the closer the distance between the values inside the cluster and the centroid. The smaller the inertia value, the higher the degree of aggregation of the data in the cluster can be evaluated. However, too many clusters can confuse the classification.
The representative genes selected through cosine distance from 20 clusters are as follows: attribute357, attribute457, attribute750, attribute722, attribute1635, attribute982, attribute936, attribute1897, attribute1515, attribute316, attribute1069, attribute1170, attribute158, attribute737, attribute640, attribute482, attribute109, attribute980, attribute43, and attribute1244. Table 2 summarizes the gene information of 62 gene values for 20 representative genes. All values in Table 2 are displayed only to four decimal places. Each row represents a gene value according to a patient's attribute, and each column represents a patient's gene information value for each attribute. We used K-fold cross-validation as an experimental verification technique [29]. All data were used as a test set at least once to increase the reliability of data verification. Figure 5 shows the process of training and testing data divided using 5-fold cross validation. Furthermore, the combination of features selected for HS was verified through 5-fold crossvalidation.
Mathematics 2021, 9, x FOR PEER REVIEW 8 of 14 in our study. The input and the hidden layers are composed of five nodes. The output layer consists of one node, and the sigmoid function is used as the activation function. We used K-fold cross-validation as an experimental verification technique [29]. All data were used as a test set at least once to increase the reliability of data verification. Figure 5 shows the process of training and testing data divided using 5-fold cross validation. Furthermore, the combination of features selected for HS was verified through 5-fold cross-validation.

Results
A total of 1000 candidate genes selected out of 2000 genes through the Fisher score were divided into 20 clusters by using K-means clustering. The optimal number of clusters was determined using the inertia value in the scikit. Figure 6 shows the inertia value according to the number of clusters. The lower the inertia value, the closer the distance between the values inside the cluster and the centroid. The smaller the inertia value, the higher the degree of aggregation of the data in the cluster can be evaluated. However, too many clusters can confuse the classification.

Results
A total of 1000 candidate genes selected out of 2000 genes through the Fisher score were divided into 20 clusters by using K-means clustering. The optimal number of clusters was determined using the inertia value in the scikit. Figure 6 shows the inertia value according to the number of clusters. The lower the inertia value, the closer the distance between the values inside the cluster and the centroid. The smaller the inertia value, the higher the degree of aggregation of the data in the cluster can be evaluated. However, too many clusters can confuse the classification. We selected eight genes from 20 representative genes using the HS feature selection method. The selected genes were attribute43 (ribosomal protein; Nicotiana tabacum), attribute737 (monoamine oxidase B), attribute936 (proteasome component), attribute1170 (GST1-Hs mRNA for GTP-binding protein), attribute1244 (mRNA for upstream binding factor), attribute1515 (grancalcin mRNA), attribute1635 (vasoactive intestinal peptide mRNA), attribute1897 (zinc finger protein mRNA), and the classification accuracy by using the ANN was 93.46%. Each attribute is closely related to CRC or cancer, and the evidence for this is supported by several studies [30][31][32][33][34][35][36].

Comparisons with Other Method Surveys
Many researchers have experimented with various classification algorithms using the colon cancer data provided by the Princeton University Gene Expression Project. Table 3 shows the number of genes selected in the present study in relation to other studies and the corresponding classification accuracies. As the range of accuracy can cause ambiguity in comparison, the representative accuracy of the research papers is shown. There are comparative papers that perform classification without using feature selection. Furthermore, there are studies that have used random forest (RF) algorithm [37], support vector machine (SVM) models [13], two-way clustering [38], and LogitBoot for 10-cross validation on the data provided by the Princeton University Gene Expression Project [39]. In addition, there are studies that derive classification accuracy through feature selection by using the Chameleon algorithm [40] and supervised group Lasso [41].  The representative genes selected through cosine distance from 20 clusters are as follows: attribute357, attribute457, attribute750, attribute722, attribute1635, attribute982, attribute936, attribute1897, attribute1515, attribute316, attribute1069, attribute1170, at-tribute158, attribute737, attribute640, attribute482, attribute109, attribute980, attribute43, and attribute1244. Table 2 summarizes the gene information of 62 gene values for 20 representative genes. All values in Table 2 are displayed only to four decimal places. Each row represents a gene value according to a patient's attribute, and each column represents a patient's gene information value for each attribute.

Comparisons with Other Method Surveys
Many researchers have experimented with various classification algorithms using the colon cancer data provided by the Princeton University Gene Expression Project. Table 3 shows the number of genes selected in the present study in relation to other studies and the corresponding classification accuracies. As the range of accuracy can cause ambiguity in comparison, the representative accuracy of the research papers is shown. There are comparative papers that perform classification without using feature selection. Furthermore, there are studies that have used random forest (RF) algorithm [37], support vector machine (SVM) models [13], two-way clustering [38], and LogitBoot for 10-cross validation on the data provided by the Princeton University Gene Expression Project [39]. In addition, there are studies that derive classification accuracy through feature selection by using the Chameleon algorithm [40] and supervised group Lasso [41]. The proposed method achieved the highest accuracy when compared with other studies, regardless of features being selected and no features being selected. The Chameleon algorithm selected the fewest features among the comparative studies. However, our proposed method achieved better accuracy compared with the Chameleon algorithm (93.46% vs. 85.48%, respectively).

Conclusions and Future Works
In this study, in order to classify CRC using gene information, a hybrid method of normalizing gene information values using Z-normalization, reducing redundant genes using the Fisher score, selecting representative genes using K-means clustering, and feature selection using the HS algorithm was proposed. In K-means clustering, selecting representative genes using the cosine distance is straightforward and effective. The feature selection method modified from the original HS algorithm maintains high accuracy and improves classification performance by applying various combinations to the model. The experimental results showed a classification performance of 93.46% with only eight genes selected using the proposed method: attribute1635, attribute936, attribute1897, attribute1515, at-tribute1170, attribute737, attribute43, attribute1244. This can lead to cost-effectiveness due to fewer genetic tests. In addition, the results of the present study will greatly contribute in the prediction of not only the CRC gene but also various other genes causing diseases. For example, hereditary breast or ovarian cancer can also be predicted through genetic testing using the proposed method [42,43]. It is important to confirm the likelihood of a cancer gene through genetic testing for people with a family history of cancer-related diseases or for people who are likely to develop cancer. Therefore, research to predict cancer by finding a small number of genes according to gene mutations will be actively conducted in the future. There is a possibility of conducting experiments in different ways.
For example, we can analyze genetic data used in our paper using other methods including single-particle tracking experiments. Additionally, our proposed methods can be applied to cancer-tracking time series data or non-genetic data (dietary, smoking or exercise) as well as genetic data to increase the objectivity and suitability of our model and data [44,45].