An Experimental Comparison of Feature-Selection and Classiﬁcation Methods for Microarray Datasets

: In the last decade, there has been a growing scientiﬁc interest in the analysis of DNA microarray datasets, which have been widely used in basic and translational cancer research. The application ﬁelds include both the identiﬁcation of oncological subjects, separating them from the healthy ones, and the classiﬁcation of different types of cancer. Since DNA microarray experiments typically generate a very large number of features for a limited number of patients, the classiﬁcation task is very complex and typically requires the application of a feature-selection process to reduce the complexity of the feature space and to identify a subset of distinctive features. In this framework, there are no standard state-of-the-art results generally accepted by the scientiﬁc community and, therefore, it is difﬁcult to decide which approach to use for obtaining satisfactory results in the general case. Based on these considerations, the aim of the present work is to provide a large experimental comparison for evaluating the effect of the feature-selection process applied to different classiﬁcation schemes. For comparison purposes, we considered both ranking-based feature-selection techniques and state-of-the-art feature-selection methods. The experiments provide a broad overview of the results obtainable on standard microarray datasets with different characteristics in terms of both the number of features and the number of patients.


Introduction
In the last decade, there has been a growing scientific interest in the analysis of DNA microarray datasets, which involved different research fields, such as biology, bioinformatics, and machine learning.DNA microarrays contain measures relative to tests for quantifying the types and the amounts of Messenger Ribonucleic Acid (mRNA) transcripts present in a collection of cells, thus allowing researchers to characterize tissue and cell samples regarding gene expression differences.These data have been widely used in basic and translational cancer research.In this context, DNA microarrays were used either to separate healthy patients from oncological ones, based on their "gene expression" profile (two class problems), or to distinguish between different types of cancer (multiclass problems).
DNA microarray experiments typically generate a very large number of features: for each patient, in fact, each feature represents a gene expression coefficient corresponding to the abundance of Messenger Ribonucleic Acid (mRNA) in a concrete sample [1].The effect is that the number of features in the raw data ranges from 6000 to 60,000, while the available datasets usually refer to less than 100 patients [2].This implies that the classification task is very challenging, due to both the high number of gene expression and the small number of samples [3], and require the implementation of an accurate feature-selection process to reduce the complexity of the feature space and to identify a subset of highly distinctive genes [4,5].It has been demonstrated, in fact, that most of the genes measured in DNA microarray experiments are not relevant to improve classification accuracy for many of the classes to be recognized [6].The application of a feature-selection process, however, is not only due to the need for the elimination of redundant or non-distinctive features, but also to more accurately correlate gene expression to diseases.This last aspect is very important for biologists and explains the scientific interest for the development of new features selection algorithms, as evidenced by the large number of works published in the literature in recent years.Nevertheless, it has been highlighted that in this framework there are no standard state-of-the-art results generally accepted by the scientific community: therefore, it is difficult to compare the effectiveness of new algorithms and decide which approach can provide better results in the general case.The problem is even more complex when considering the large number of microarray data sets available whose properties, in terms of both the number of features and the number of patients, can vary significantly [2].
Based on these considerations, the aim of the present work is to provide a broad experimental comparison on the feature-selection and classification techniques applied to different DNA microarray datasets.Unlike other works that have analyzed these aspects separately [2,7], our aim is to study the combined effects of the feature-selection process on the classification results, and to evaluate the sensitivity of the considered classification methods with respect to the size of the feature space.Moreover, taking into account the complexity of the feature-selection problem in case of thousands of features, we focused our analysis on standard feature ranking techniques and we evaluated the classification methods considering in the experiments feature sets of increasing size, obtained by selecting the features according to the order in which they appear in the ranking [8].For the sake of completeness, we have also compared the obtained results with those relative to the use of three state-of-the-art feature-selection methods.
In the experiments, we considered six DNA microarray datasets with different properties in terms of both the number of features and the number of patients.As for the classification process, we chose four classification methods, namely decision trees, random forest, nearest neighbor and multilayer perceptron, thus providing the interested reader with a broad overview of the results obtainable with the most efficient and widely used classification schemes.
The remainder of the paper is organized as follows: Section 2 briefly illustrate the characteristics of the considered DNA microarray datasets, Section 3 discusses the feature-selection methods, while Section 4 presents the experimental results.Some conclusions are eventually left to Section 5.

The Considered DNA Microarray Datasets
We tested the approaches considered for the experimental comparison on the following, publicly available, microarray datasets: Breast, Colon, Leukemia, Lymphoma, Lung and Ovarian.The characteristics of the datasets are summarized in Table 1.They differ in terms of total number of available attributes, number of classes, and number of samples.
Let us denote with N S the number of samples and with N G the number of genes.As anticipated in the Introduction, the samples in each dataset represent expression levels of genes in cancer or normal (healthy) tissues.It is worth remarking that for all the datasets N G >> N S (N G is of the order of thousands, while N S of the hundreds).These datasets are described in the following.

•
Breast cancer typically develops in either the lobules or the ducts of the breast, but it can also occur in the fatty tissue or the fibrous connective tissue within the breast.The samples of the Breast dataset were extracted from frozen tumor tissue samples from patients with lymph-node-negative breast cancer, but who did not receive systemic neo-adjuvant or adjuvant therapy [9].The dataset contains 286 samples, labelled according to the estrogen receptors (ER), with 219 of them below the cut-off value and the remaining ones above this value.As concerns the number of genes, each sample is represented by 17,816 genes.

•
Colon tumor is a disease in which cancerous growths (tumors) are found in the tissues of the colon.This dataset contains 62 samples.Among them, 40 tumor biopsies are from colon adenocarcinoma specimens (snap-frozen in liquid nitrogen within 20 min of removal), whereas 22 normal biopsies are from healthy parts of the colons of the same patients.The total number of genes to be tested is 2000.The samples were analyzed with an Affymetrix oligonucleotide array [10].

•
Leukemia is a primary disorder of bone marrow.They are malignant neoplasms of hematopoietic stem cells.The Leukemia dataset contains 72 samples, each represented by 7129 genes.All samples represent acute leukemia patients, either acute lymphoblastic leukemia (ALL) or acute myelogenous leukemia (AML) [6].

•
The term lymphoma encompasses a broad variety of cancers of the lymphatic system, and the diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin's lymphoma.The Lymphoma dataset contains samples representing DNA microarrays of gene expression in B-cell malignancies [11].The total number of available genes is 4026 and the number of samples is 62.As concerns the number classes, the samples belong to nine different classes, each representing a different stage of the disease.The DNA microarray analysis of gene expression was done as described in [12].

•
Ovarian cancer is caused the by the uncontrollable growth of the cells in the ovaries.This growth produces a lump of tissue.It is one of the most common types of cancer in women and mainly affects those over the age of 50.Moreover, ovarian cancer symptoms are similar to those of some more common conditions, and this make it not always easy to diagnose.The Ovarian dataset contains 216 samples (100 patients and 116 controls).In this dataset, all major epithelial subtypes of ovarian cancer are represented [14].

Feature Selection
Feature selection is the process of reducing the dimensionality of the available data, with the aim of improving the recognition results.This process typically consists of three steps: a search procedure for searching the solution space made of all the possible solutions, i.e., feature subsets, an evaluation function, and a stopping criterion.
The main difficulty of feature selection is due to the dimension of the search space, which grows exponentially with the number of available features: for data represented by N features, there are 2 N feasible solutions.This makes the exhaustive search strategy computationally intractable [15].This is typically the case of microarray datasets in which, as previously evidenced, N is of the order of thousands.To overcome this problem, heuristic algorithms are typically used for finding near-optimal solutions.As for the evaluation functions, they are generally subdivided into three wide categories, namely filter, wrapper and embedded methods [16,17].Filter methods measure statistical or geometrical properties of the subset to be evaluated, whereas wrapper functions adopt as evaluation measure the accuracy achieved by a given, previously chosen, classifier.Finally, embedded approaches include feature selection in the training process, thus reducing the computational costs due to the classification process needed for each subset.
As just mentioned, wrapper methods are computationally costly because the evaluation of each subset requires the training of the adopted classifier.For this reason, they are typically used with near-optimal search strategies, which can achieve acceptable results, but limiting the computational costs.As for the filter methods, they need non-iterative computations on the dataset which are, in most of the cases, significantly faster than classifier training sessions.Moreover, filter approaches, estimating intrinsic properties of the data, provide more general solutions, typically performing well on a wide family of classifiers.
A particular category of filter methods is that of the ranking ones, which evaluate each feature singularly.Once all features have been evaluated, they are ranked according to their merit.Then the subset search step is straightforward: the best M features are selected, with M set by the user.Unfortunately, even if this approach is very fast and allow dealing with thousands of features, there is no one general criterion for choosing the dimension of the feature space, then it is difficult to select the number M of features to be selected.Moreover, most importantly, relevant features that are highly informative when combined with other ones could be discarded because they are weakly correlated with the target class.
In this study, because of the huge dimensionality of the datasets used for the experiments, we have considered only standard ranking algorithms to build up an experimental protocol for selecting the feature subsets allowing us to achieve the best classification results.The adopted protocol did not use any search strategy, but adds the features progressively, according to their position in the ranking.As for the ranking, we have considered five standard univariate measures, namely Chi-square, Relief, Gain Ratio, Information Gain, and Symmetrical Uncertainty.These measures, as well as the state-of-the-art feature-selection approaches considered for the comparison, are detailed in the following subsections.

Chi-Square
The Chi-Square (CS) approach implements a discretization algorithm based on the CS statistic [18].For each feature, the values present in available data are first sorted, and each value represent an interval.Then the Chi-square statistic is used to assess the class relative frequencies of adjacent intervals.For each couple of adjacent intervals, if their frequency similarities are above a given threshold, they are merged.Class similarities are computed according to the following formula [18]: where C is the number of classes, A ij is the number of instances of the j-th class in the i-th interval and E ij is the expected frequency of A ij given by the formula: where R i is the cardinality of the i-th interval and C j and NT represent the number of samples of the j-th class and the total number of samples, respectively, in the two adjacent intervals.The stopping criterion of the merging process is based on the choice of a threshold, which represents the maximum value of the frequency differences in adjacent intervals.In the experiments detailed in Section 4, this value was set according to the results of some preliminary experiments.

Relief
The Relief (Re) measure uses an instance-based learning approach to assign a weight to each feature, representing its relevance with respect to the target concept [19].This weight is computed by finding for each sample the nearest neighbor of the same class (nearest hit) and the nearest neighbor of a different class (nearest miss).A given feature receives a high weight if it takes different values for instances from different classes and similar values for instances belonging to the same class.Given a feature X, whose values belong to the set of discrete values {x 1 , x 2 , . . ., x n }, its weight is computed according to the following formula [19]: where p(x i ) is the probability that the feature X assumes the value x i , S C is the set of classes to be discriminated, p(c k ) is the a-priori probability of the k-th class and I G is a modified version of the Gini index [19], given by the following formula:

Information Gain, Gain Ratio Gain and Symmetrical Uncertainty
The measures detailed in this subsection are based on the well-known information-theory concept of entropy.Given a discrete variable X, which can take the values {x 1 , x 2 , . . ., x n }, its entropy H(X) is defined as: where p(x i ) is the probability mass function of the value x i .In practice, H(X) estimates the uncertainty of the random variable X.
The conditional entropy of two discrete random variables X and Y, taking the values {x 1 , x 2 , . . ., x n } and {y 1 , y 2 , . . ., y n } respectively, is defined as: where p(x i , y j ) is the joint probability that X = x i and Y = y j .In this case, H(X|Y) represents the amount of randomness in the random variable X when the value of Y is known.The concepts of entropy and conditional entropy just defined can be used to evaluate the effectiveness of a given feature in predicting the class of unknown samples.The first of the three information-theory-based measures considered in this study is denoted as Information Gain (IG) [20].For a given a feature X, IG(X) is computed in terms of both the class entropy H(C) and the conditional entropy H(C|X).More specifically, such quantities are used to compute IG(X) as follows: It is worth noting that IG(X) measures the reduction of the randomness of the variable C when X is known.In practice, IG(X) measures the quantity of information about C provided by the feature X.
Finally, the Gain Ratio and Symmetrical Uncertainty measures are based on the just mentioned IG measure.More specifically, Gain Ratio (GR) is computed according to the following formula [21]: As for the Symmetrical Uncertainty (SU), instead, it has been defined in such a way to compensate the bias toward the attributes taking more values.This is achieved by normalizing its value to the range [0, 1] [20]:

Heuristic Search Algorithms
For the sake of comparison, we have also considered three state-of-the-art heuristic search algorithms, namely the Sequential Forward Floating Search, the Fast Correlation-Based Filter and the Minimum Redundancy Maximum Relevance, whose properties are briefly summarized in the following.

•
Sequential Forward Floating Search.This feature-selection algorithm is based on a greedy hill-climbing iterative strategy.Given a feature subset evaluation function f (S), it starts with the empty set S 0 and, at the i-th step, selects among the available features (i.e., those not selected yet) the feature X b that produces the highest increment of f (S i−1 ∪ X b ).Once X b has been selected, the algorithm also checks if f (S i ) can be increased excluding from S i one of the features previously selected.
It is worth noting that we used an improved version of this algorithm, presented in [22].It will be denoted as SFFS in the following.

•
Fast Correlation-Based Filter.This algorithm searches for subsets containing the features that are most relevant to the target concept and least redundant each other.This search is driven by an evaluation function that is based on the concept of mutual information I. Given two random variables X and Y, the mutual information I(X, Y) quantifies the amount of information obtained about one random variable through observing the other random variable.The mutual information is first used to sort the available features, in decreasing order, according to mutual information with the class.Once the features have been ranked, the final subset is built by iteratively selecting the top-ranked features, but skipping the redundant ones, i.e., those that have a high mutual information level with one of the previously selected features [23].It will be denoted as FCBF in the following.

•
Minimum Redundancy Maximum Relevance.This approach searches for subsets that maximize the relevance with the target class and minimize the redundancy among the selected features.Given a feature subset S, its relevance is computed by averaging the mutual information I(x i , C), with x i ∈ S and C representing a class.As for the redundancy of S, it is computed by averaging I(x i , x j ), with x i , x j ∈ S. It is worth noting that the mutual information can be computed for both discrete and continuous features.In case of continuous features, this quantity is obtained by using the Parzen windows method to estimate feature probability densities.Further details of the algorithm can be found in [24].It will be denoted as MRMR in the following.

Experimental Results
As anticipated in the Introduction, the effectiveness of the feature subsets obtained by applying the selection techniques described in Section 3, has been evaluated by using four different classification schemes, namely decision trees, random forest, nearest neighbor, and multilayer perceptron.For each of them, we performed a set of experiments using the microarray datasets described in Section 2, following the same experimental protocol.The adopted classification schemes, the experimental protocol, and the sets of experiments are described in the following subsections.

The Classification Schemes
As for the classifiers considered in this study, we used the implementation provided by the WEKA tool [25] choosing a 10-fold cross validation strategy.Furthermore, to manage the randomness of classification algorithms, we performed 20 runs for each experiment.All the results reported below were obtained by averaging the values over the 20 runs.
A decision tree (DT) is a classification method that uses a knowledge representation based on a tree structure.In such structure, the internal nodes represent specific tests on the features used to represent patterns, while the branches coming out of a node represent the possible results of these tests.Finally, the leaf nodes represent the class assigned to an input sample based on the results of all the tests performed along a path from the root node to the leaf node.In this study, we considered C4.5 learning algorithm, which builds the decision tree with a top-down approach, using the concept of information entropy.In practice, given a training set TS, it breaks down TS into smaller subsets by gradually increasing the depth of the tree, until it reaches a leaf node in which the process terminates.In each node of the tree, C4.5 chooses the feature that most effectively allows the splitting of the corresponding sample subsets, using the normalized entropy gain as a splitting criterion: this criterion measures how much the obtained subsets are homogeneous, in terms of class labels.
The term random forest (RF) does not refer to a single algorithm, but rather to a family of methods for creating an ensemble of tree-based classifiers.The original algorithm proposed by Breiman in [26] is usually denoted Forest-RI in the literature and it is considered a reference method in most of the papers dealing with RF.Given a training set that contains N f feature vectors, each consisting of N features, the forest-RI algorithm is composed of the following steps applied to each tree: 1.
Randomly extract N f samples with replacement from the dataset.The obtained set of samples will be used as training set for the starting node of the tree.

2.
Set a number K N.

3.
At each node, randomly select K features from the whole set of available ones.

4.
For each of the selected features, consider its values in the training set and choose the best binary split value according to the Gini index [27].Then, choose the feature with the best index value and generate two new nodes by splitting the samples associated with the original node according to such a value.

5.
Increase the depth of the tree to its maximum size, according to the defined stopping criterion.Note that node splitting usually is stopped when one of the following conditions occur: (i) The number of samples in the node to be split is below a given threshold; (ii) all the samples in the node belong to the same class.6.
Leave the tree not pruned.
Once the forest has been built, an unknown sample is labeled according to the Majority Vote rule: i.e., it is labeled with the most popular class among those provided by the ensemble trees.
The K-Nearest Neighbor algorithm (K-NN) is a well-known non-parametric classification technique, according to which an unknown sample is assigned the most frequent class among those of its k-nearest samples of the training set.The rationale behind this technique is that given an unknown sample x to be assigned to one of the c i classes of the considered application field, the a-posteriori probabilities p(c i x) in the neighborhood of x may be estimated by taking into account the class labels of the k-nearest neighbors of x.Although its simplicity, K-NN proved to be able to provide high performance.The results shown below were obtained by using the Mahalanobis distance, which demonstrated to be more effective than the Euclidean distance in a set of preliminary experiments.As regards the value of the parameter k, we found that the value k = 1 significantly outperformed the higher values.This result suggests that for these kinds of data, higher value of k force the algorithm to consider far neighbors belonging to classes different from the actual class of the sample at hand.The effect is that of assigning several samples to a wrong class, reducing the overall recognition rate.This behavior could also be a consequence of the limited number of specimens available in the training set, as well as a specificity of the considered DNA microarray datasets.
An artificial Neural Network (NN in the following) is a well-known information processing paradigm inspired by the way biological nervous systems process information.These networks typically consist of many highly interconnected processing elements (neurons), often equipped with a local memory, that can process information locally, providing as output a single unidirectional signal transmitted to all the connected neurons.A NN can be configured to solve specific problems, such as pattern recognition or data classification ones, through a learning process, which implies the adjustment of the information locally stored in each neuron.In this study, we considered the multilayer perceptron network with a feed-forward completely connected three-layer architecture.The training was performed by using the back-propagation algorithm, which is one the most popular training method, successfully applied to a large variety of real-world classification tasks.

The Experimental Protocol
To illustrate the experimental protocol followed in all the experiments related to the use of feature ranking techniques, let us consider the first DNA microarray dataset, namely Breast, and the ranking provided by the first univariate measure, namely CS: by using this feature ranking technique, we generated different representations for such dataset, each containing an increasing number of features.More specifically, we generated 10 representations in the following way: in the first one, the samples were represented by using the first n 1 features in the ranking, in the second one by using the first n 2 features, in the third one the first n 3 features and so on.The considered feature numbers are the following: 5, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10,000.The same procedure has been repeated for the other univariate measures taken into account.The whole process has been applied to the other DNA microarray datasets considered in this study.Summarizing, for each microarray dataset, we produced 5 different feature rankings, each used to generate 10 additional representations, totaling 50 different representations.Each of them has been used in the experiments for evaluating the obtainable classification results, applying the previously described classification schemes.
As for the experimental protocol relative to the use of the heuristic search algorithms, both Sequential Forward Floating Search and Fast Correlation-Based Filter produced an additional representation of each microarray dataset, in which the samples were represented by using only the features respectively selected.Thus, such algorithms allowed us to generate 2 additional representation of each dataset.On the contrary, Minimum Redundancy Maximum Relevance method requires setting a-priori the number N of feature to be selected and then it finds the best feature subset with such a cardinality.Thus, to effectively compare the results of this method with those relative to the other ones, we decided to consider for N the same set of values used in the experiments for the ranking-based feature-selection approaches.Moreover, for the sake of comparison, we have also considered, for each microarray dataset, the values of N respectively provided by the other two heuristic search algorithms.

Experimental Findings
For the sake of conciseness, we reported in the tables only the best results obtained in each experiment.Please note that in all the tables we indicated with RR and NF the recognition rate and the number of features, respectively, and we put in bold the best result obtained for each dataset.Tables 2 and 3 show the results obtained by using the decision tree classifier: the first table refers to ranking-based feature-selection techniques, while the second one to heuristic search algorithms.For comparison purposes, the last column of both tables indicates the recognition rate relative to the use of the whole feature set.Similarly, Tables 4 and 5 show the results obtained by using the random forest classifier, Tables 6 and 7 those obtained by using the K-Nearest Neighbor classifier, while Tables 8  and 9 report the results relative to the Neural Network classifier.Note that the Neural Network classifier has not always been able to complete the training phase when all the features have been used.In fact, in the case of Breast, Leukemia and Lung datasets, the back-propagation algorithm did not produce significant results due to the enormous complexity of the feature space.This situation has been evidenced in Tables 8 and 9 inserting an asterisk instead of the result.
The analysis of the results reported in the tables suggests the following considerations: • there is not a single classification scheme or a single feature-selection method that outperforms all the others but, for each microarray dataset, the effectiveness of the whole classification system is due to the combined effect of both feature-selection method and classification scheme; • for each classification scheme, the best results obtained by using ranking methods are generally comparable with those relative to the use of heuristic search, even if there are in many cases significant differences among the results of the ranking methods as well as among those of the heuristic search ones; • the number of selected features is generally much less than the total number of available features, while the classification results are always significantly better than those corresponding to the use of the whole feature set: this confirms that for the analysis of DNA microarray datasets, the application of features selection techniques not only reduce the complexity of the feature space, but also allow users to significantly improve classification performance; • accepting slight variations in the recognition rate (less than 2%), it is possible to obtain very significant reductions in the number of selected features and, therefore, in the computational cost of the classification system.Consider, for instance, the Ovarian dataset for which the best result, equal to 96.85% with 500 features, was obtained by using the feature ranking method based on GR measure with a neural classifier.In the same experimental condition, but selecting only 200 features, the recognition rate was 96.01%.Thus, a reduction of 60% in the dimension of the feature space resulted in a reduction of less than 1% in the recognition rate.We obtained similar results in most of the experiments performed in this study.See, for instance, the results on the Brest dataset with RF classifier and ranking-based feature-selection methods (Table 4): the best results, equal to 89.70% with 200 features, was obtained with CS method, while GR method provided a very similar recognition rate, equal to 89.44% but using only 50 features.Please note that many of these results are not shown in the tables since, for each set of experiments, we reported only the one providing the best recognition rate.
Finally, Table 10 summarizes the best result obtained for each microarray dataset by applying the feature-selection techniques considered in this study.For comparison purposes, the table also shows the best results obtained by using the whole feature set.It is interesting to note that in all the experiments the random forest classifier provided the best results without feature selection: this outcome is in good accordance with the theory, since it is known that such classifier performs very well when the feature number is very high.The data in Table 10 also show that the best performance with feature selection was provided by KNN and NN classifiers.It should be noted, however, that the best results provided by the other two classification schemes, namely DT and RF, are slightly lower, but obtained by selecting in the average a smaller number of features.

Conclusions and Future Works
In this study, we performed a wide experimental comparison on the combined effect of both feature-selection and classification techniques applied to different DNA microarray datasets, paying particular attention to evaluate how much the number of selected features can affect the obtainable classification performance.
Since DNA microarray experiments typically generate a very large number of features (in the order of thousands) for a limited number of patients (in the order of hundreds or less), the classification task is very complex and typically require the application of a feature-selection process to reduce the complexity of the feature space and to identify a subset of distinctive features.
In our experiments, we considered six microarray datasets, namely Breast, Colon, Leukemia, Lymphoma, Lung and Ovarian, and four classification schemes, namely decision trees, random forest, nearest neighbor, and multilayer perceptron.Such datasets represent an effective test bed, since they exhibit a large variability as regards the number of attributes, the number of classes (two or multiple classes problems) and the number of samples.The considered classifiers provide a broad overview of the results obtainable with the most efficient and widely used classification schemes.
As for the feature-selection process, taking into account the complexity of this problem in the case of thousands of features, we focused our analysis on standard feature ranking techniques, which evaluate each feature singularly, estimating intrinsic properties of the data.These techniques typically provide general solutions, performing well on a wide family of classifiers, even if they could discard relevant features that are weakly correlated with the target class, but highly informative when combined with other features.The performance of the classification methods was evaluated by using feature sets of increasing size, obtained by selecting the features according to their position in the ranking.Despite this simplified choice, we obtained very interesting results even in comparison with those produced by the other three state-of-the-art feature-selection methods considered in this study, namely the Sequential Forward Floating Search, the Fast Correlation-Based Filter and the Minimum Redundancy Maximum Relevance.
The results confirmed that the feature-selection process plays a key role in this application field and that a reduced feature set allowed us to significantly improve classification performance.They also showed that by accepting a slight reduction of the classification rate, it is possible to further reduce the feature set, significantly improving the computational efficiency of the whole classification system.
As a future work, we would like to improve the classification system by using better performing strategies, such as those based on classifier ensembles [28][29][30].The use of Learning Vector Quantization networks could also provide interesting results, as they are generally considered a powerful pattern recognition tool and give the advantage of providing an explicit representation of the prototypes of each class [31] (in this case, a prototype represents the specific gene expression values characterizing tissue and cell samples).

Table 1 .
The datasets considered for the experiments.

Table 2 .
Best results of ranking-based feature-selection methods with DT classifier.

Table 3 .
Best results of heuristic search algorithms for feature selection with DT classifier.

Table 4 .
Best results of ranking-based feature-selection methods with RF classifier.

Table 5 .
Best results of heuristic search algorithms for feature selection with RF classifier.

Table 6 .
Best results of ranking-based feature-selection methods with kNN classifier.

Table 7 .
Best results of heuristic search algorithms for feature selection with kNN classifier.

Table 8 .
Best results of ranking-based feature-selection methods with NN classifier.

Table 9 .
Best results of heuristic search algorithms for feature selection with NN classifier.

Table 10 .
Summary of the best results.