A New Algorithm for Cancer Biomarker Gene Detection Using Harris Hawks Optimization

This paper presents two novel swarm intelligence algorithms for gene selection, HHO-SVM and HHO-KNN. Both of these algorithms are based on Harris Hawks Optimization (HHO), one in conjunction with support vector machines (SVM) and the other in conjunction with k-nearest neighbors (k-NN). In both algorithms, the goal is to determine a small gene subset that can be used to classify samples with a high degree of accuracy. The proposed algorithms are divided into two phases. To obtain an accurate gene set and to deal with the challenge of high-dimensional data, the redundancy analysis and relevance calculation are conducted in the first phase. To solve the gene selection problem, the second phase applies SVM and k-NN with leave-one-out cross-validation. A performance evaluation was performed on six microarray data sets using the two proposed algorithms. A comparison of the two proposed algorithms with several known algorithms indicates that both of them perform quite well in terms of classification accuracy and the number of selected genes.


Introduction
Approximately 10 million people worldwide die from cancer every year, or one in every six deaths, according to the WHO [1]. Early diagnosis and treatment can reduce the cancer mortality rate. Wrong classifications and predictions cause serious harm to patients and their families [2]. Generally, microarray data are employed in cancer research, where early detection of cancer can greatly influence the treatment and survival rate [3]. Nevertheless, microarray data suffer from high dimensionality issues since the number of genes far outnumbers the number of samples, with the result of the so-called "curse of dimensionality". When the dimensionality of a data set rises significantly, it can be difficult to demonstrate the statistical significance of the results [4].
There have been four approaches to solving the "curse of dimensionality": filtering, wrapper, embedded, and hybrid methods [5]. The filtering method evaluates the relevance of features as scores based only on property values. It is possible to sort features by their scores and to remove low-scoring features. In wrapper methods, the analysis model is embedded within the search for appropriate features. Embedded methods search for an optimal subset of features as part of the analysis algorithm. Hybrid methods combine two methods for selecting features to take advantage of both [6].
Two feature-selection methods based on wrapper-based algorithms are presented in this paper, both of which employ the Harris Hawks Optimizer (HHO) to select the most informative genes for classification and achieve high accuracy: HHO-SVM works in conjunction with support vector machines (SVM), and HHO-KNN works in conjunction with the k-nearest neighbors (k-NN) algorithm. To evaluate the effectiveness of both HHO-SVM and HHO-KNN, we compared the results from six microarray cancer data sets to several recently published techniques. In both binary and multiclass classifications, HHO-KNN and HHO-SVM appear to be able to achieve higher classification accuracy with a smaller number of genes selected. This paper aim test HHO as a feature selection method and view its effectiveness. This paper will answer those two questions: Can we use HHO as a feature selection method on well-known cancer gene microarray datasets? In addition, which classifier works best with HHO?
The paper is structured as follows: Section 2 describes how HHO was inspired and the mathematical modeling that went into it. In Section 3, we introduce our proposed HHO-SVM and HHO-KNN approaches to gene selection. Discussions and experimental results are presented in Section 4. Finally, the conclusion is given in Section 5.

Inspiration
The Harris Hawks Optimizer (HHO) is a swarm computation method that was developed by Heidari et al. in 2019 [7]. This algorithm was inspired by the cooperative hunting and chasing behavior exhibited by Harris's hawks, particularly "surprise pounces" or "the seven kills." In a cooperative attack, numerous hawks coordinate their efforts and simultaneously attack a rabbit that has shown itself.
The attack could well be accomplished quickly by catching the surprised prey in a matter of seconds; however, depending on the prey's actions and ability to flee, the attack may include repeated, short, fast dives near the prey over the course of many minutes. According to the changing circumstances and the prey's escape patterns, Harris's hawks can demonstrate a variety of chasing styles. Generally, tactics are changed when the party's strongest hawk (leader) goes after the prey but loses it, at which point another party member continues the chase. It is common to observe these switches in a variety of settings because they are used to confuse escaping rabbits. Moreover, the rabbit has no way of regaining its defensive abilities when a new hawk begins to chase it, and it is unable to escape the attacking team since any hawk, usually the most experienced and powerful, captures the exhausted rabbit and shares it with the rest of the team.

Mathematical Modeling
Hawks are known to chase their prey by tracing, encircling, and eventually striking and killing. The mathematical model, which is based on hawks' hunting behaviors, comprises three various stages: exploration, transition between exploration and exploitation, and exploitation. At each stage of the hunt, the Harris's hawks are the candidate solutions, and the targeted prey is the best candidate solution (almost the optimal).
As they search for prey, Harris's hawks use two different exploration techniques. Candidate solutions are designed to be as close to the prey as possible, while the best is the one that is the intended prey. First, Harris's hawks choose a spot by considering the locations of other hawks and their prey. In the second method, the hawks wait on random tall trees. Using Equation (1), the two methods can be simulated with equal odds of q: (1) • Vector x(t) is the current hawk position, whereas vector x(t + 1) is the hawk's position at the next iteration. • The hawk x random (t) is selected at random from the population. • The rabbit position is x rabbit (t). • q, r 1 , r 2 , r 3 and r 4 are randomly generated numbers inside (0,1). • LB and UB are the upper and lower bounds of variables. • x mean (t) is the average position of the current population of hawks, which is calculated as shown in Equation (2).
• t is the total number of iterations.
The total number of hawks is represented by N.
The algorithm switches from exploration to exploitation (transition from exploration to exploitation) depending on the rabbit's running or escaping energy, as shown in Equation (3).
• E represents the prey's escaping energy. • The initial state of the energy is indicated by E 0 , which changes randomly inside (−1, 1) at each iteration.
When |E| 1, hawks seek out more areas to investigate the rabbit's whereabouts; alternatively, the exploitation stage begins. The algorithm formulates the rabbit's escape success p 0.5 or failure p < 0.5 with an equal chance p. The Hawks also will also carry out a soft |E| 0.5 or hard siege |E| < 0.5, based on the rabbit's energy. The soft siege is defined as in Equations (4)-(6).
• The difference between the hawk and rabbit positions is represented by ∆x(t). • J is a random number used to generate the rabbit's random jump force.
A hard siege, on the other hand, can be calculated as follows in Equation (7): A soft siege with repeated fast dives is attempted when |E| 0.5 and p < 0.5, as the rabbit could successfully escape. The hawks have the option of selecting the best dive. Lévy flight is employed to imitate the prey's hopping. The hawks' next action is calculated as shown in Equation (8) to determine whether the dive is successful or not.
The hawks will dive following Equation (9), the Lévy flight L pattern, if the previous dive turns out to be ineffective.
• The problem dimension dim is the size of the random vector RandomVector, and dim is the dimension of the problem.
Equation (10) has been used to update the final soft-siege rapid dives Equations (8) and (9) are used to calculate k and z, respectively. A hard siege with progressive rapid dives occurs when |E| 0.5 and p < 0.5 are not sufficient for the rabbit to flee, as it no longer possesses enough energy. The rabbit's z is calculated via Equation (9), while k is updated using Equation (11).

Proposed Algorithm
In this section, how the two proposed algorithms work will be described in detail. We combined HHO with SVM and k-NN to develop two approaches, HHO-SVM and HHO-KNN, for solving the microarray high dimensionality issue to find the most meaningful genes and compare between SVM and k-NN classifiers to find which one gives the best accuracy and selects the fewest genes. The fitness function that is used is the error rate.
The steps of both the HHO-SVM and HHO-KNN algorithms are shown in Figure 1. In addition, in Algorithm 1, we present pseudo code for the HHO algorithm.
To evaluate the performance of the two proposed approaches, leave-one-out crossvalidation (LOOCV) was used to avoid model overfitting of both classifiers to calculate the accuracy. All samples except one are used as testing data in LOOCV, with the remaining sample used as training data. This is repeated until all samples have been tested. Based on N times of classification, the LOOCV calculates the average accuracy. Update the location vector using Equation (7); end else if (r < 0.5 and |E| 0.5) then Update the location vector using Equation (10); end else if (r < 0.5 and |E| < 0.5) then Update the location vector using Equation (10); end end end end return x rabbit

Experimental Results and Discussions
The exploratory approach, the findings of implementing the proposed algorithms to microarray cancer data sets, and the gene expression data sets used in the study are all described in this section.

Data Sets
In our study, we used two publicly available microarray cancer data sets and binary and multiclass data sets. The performance and effectiveness of the two algorithms were evaluated by evaluating six benchmark microarray data sets. The three binary data sets used were for colon tumors [8], lung cancer [9], and leukemia3 [10]. In addition, there were three multiclass data sets, which were leukemia2 [8], lymphoma [11], and SRBCT [11]. A detailed breakdown of the experimental data sets on the basis of diverse samples and classes can be found in Table 1.

Parameter Settings
To determine the most suitable solution, SVM and k-NN classifiers were used. Since k = 7 performed well across all test sets, it was used in the experiments. There are two significant factors that influence the practicality of a method: its iterations (Max_iter) and its dimensions. In addition to k, dimensions, UB, and LB, there are other parameters, which can be found in Table 2.

Results and Analysis
Features are selected to improve the accuracy of the classification while lowering the number of features being used. Each data set was processed with the two algorithms on a different number of features. We applied the proposed techniques in each cancer data set by using 1 to 30 genes. For evaluating the experimental results, both HHO-KNN and HHO-SVM were applied to binary and multiclass high-dimensional microarray cancer data sets for selecting genes. There were two metrics used in our comparison: classification accuracy and the number of genes selected for cancer classification. Here are the experimental results for all of the cancer data sets that were used.
On the colon data set, Table 3 shows the best, worst, and average classification accuracy using the HHO-KNN and HHO-SVM algorithms. Interestingly, the highest classification accuracy obtained was the same when either the k-NN or the SVM classifier was applied with 90.32%. However, with SVM, the number of selected genes was 10 genes that is lower than the selected genes for k-NN, with 16 genes. Looking at Table 4 for Leukemia2 data set results , we can see both HHO-KNN and HHO-SVM by selecting the same number of genes (11 genes); the SVM classifier is more accurate, 97.22%. The results of implementing HHO-SVM and HHO-KNN algorithms in the leukemia3 data set are shown in Table 5. When k-NN was used, the best classification accuracy was achieved when 25 genes were selected. The classification accuracy increased to 90.28% for k-NN and 84.72% for SVM.
The accuracy performance of best, average, and worst HHO-SVM and HHO-KNN algorithms in the Lung data set is presented in Table 6. It shows the highest accuracy of 100% when 2 or 10 genes are selected for both the k-NN and SVM classifiers. Table 7 shows the best, worst, and average classification accuracy of Lymphoma data set for applying the HHO-KNN and HHO-SVM algorithms. It achieve an accuracy of 100% in most cases for both classifiers, but the selected genes for k-NN was lower than SVM to achieve 100% accuracy with 2 genes in k-NN and 3 genes for SVM.  Table 8 compares the average, best, and worst accuracy performance on the implementation of HHO-SVM and HHO-KNN algorithms in the SRBCT data set. The highest accuracy was when 29 genes are selected with 92.77% for SVM and 91.57% for k-NN.

Comparative Evaluations
Comparing and evaluating the performance of HHO-SVM and HHO-KNN against the other bio-inspired metaheuristic gene selection algorithms was an important part of our evaluation. Table 9 shows how our findings compare based on accuracy and the number of genes selected.
As can be seen in Table 9 for lung and lymphoma, the HHO-KNN accuracy outperformed the other bio-inspired gene selection algorithms since it reached 100% classification accuracy, and the number of selected genes is smaller than the other methods. HHO-SVM for the lung data set outperformed the other bio-inspired gene selection algorithms. In addition, as can be seen from the table, HHO-KNN and HHO-SVM performed better than their competitor (BQPSO) on the colon data set. Table 9. Comparison between the proposed selection methods and previous methods in terms of the number of selected genes and accuracy.

Conclusions
Our study proposes two new feature selection techniques using Harris Hawks Optimization (HHO) combined with support vector machines (SVM) and the k-nearest neighbors (k-NN) algorithm for high-dimensional cancer gene selection and classification. The objective of this study was to devise a new algorithm for solving gene selection problems based on bio-inspired principles. Using HHO-SVM and HHO-KNN on six binary and multiclass high-dimensional cancer microarray data sets, we have shown that in terms of classification accuracy and the number of chosen genes, our two algorithms are better than several other algorithms in finding useful informative genes.
The experimental findings are displayed by using gene expression data sets, and with all of the data sets, we observe that we only achieved 100% accuracy for the lung and lymphoma datasets. Additionally, the accuracy obtained for the entire dataset with the KNN classifier and the SVM classifier is greater than 90%. Last but not least, on all datasets except the Leukemia3 dataset, HHO-KNN outperformed HHO-SVM. As well as discovering the tremendous promise for HHO when used alone, we recommend combining HHO with another wrapper bio-inspired feature selection methodology to produce a hybrid method that enhances HHO accuracy for future works while picking fewer genes.