1. Introduction
Microarray is a multiplex technology used in molecular biology and medicine that enables biologists to monitor expression levels of thousands of genes [
1]. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer [
2] and to discover new drug designs in the pharmaceutical industry [
3]. According to the World Health Organization, cancer is among the leading causes of death worldwide accounting for more than 8 million deaths. Therefore, finding a mechanism to discover the genetic expressions that may lead to an abnormal growth of cells is a first order task today. To build a microarray, short sequences of genes tagged with fluorescent materials are printed on a glass surface for hibridization [
4]. Then, the slice is scanned and goes through various data processing steps including image data collection, quality control and normalization. The resulting dataset is a two-dimensional array 
 with thousands of columns (genes) and several rows (instances):
	  
Every instance  (a row in D) is described by a row vector  that represents a labeled genetic expression:  refers to the expression level of gene , and  is the classification for the j-th sample. C may represent different types of cancer or a binary label for cancerous and non-cancerous tissue.
Analysis of microarray data presents unprecedented opportunities and challenges for data mining in areas such as: sample classification and gene selection [
5,
6]. For sample classification, the microarray matrices serve as training sets to a given classifier, to find a classification function 
 that is able to classify an arbitrary sequence of genes with unknown class from 
. Classification function 
ℓ is built from analysing the relation between labeled sequence of genes in 
D. The performance of supervised classifiers is often measured in three directions: efficiency, representation complexity and accuracy. The efficiency refers to the time required to learn the classification function 
ℓ, while the representation complexity often refers to the number of bits used to represent the classification function [
7]. One of the most common metrics to measure the accuracy of a supervised classifier is the error rate defined as:
	  
      where 
m is the number of sequence of genes in 
 and 
 is the complement of the Kronecker’s delta function, which returns 0 if both arguments are equal and 1 otherwise.
The main obstacle in microarray datasets arises from the fact that the genes greatly outnumber the sample observations. As a popular example, in the “Leukemia” dataset, there are only 72 observations of the expression level of 7129 genes [
8]. It is clear that, in this extreme scenario sample, classification methods cannot perform well because of the “curse of dimensionality” phenomena, where excessive features may actually degrade the performance of a classifier if the number of training examples used to build the classifier is relatively small compared to the number of features [
7].
Feature selection plays an essential role in microarray data classification since its main goal is to identify and remove irrelevant and redundant genes that do not contribute to minimize the error of a given classifier [
9]. Basically, the advantages of feature selection include selecting a set of genes 
 with:
	  
      where 
 is the result of projecting 
 over 
D. In addition, when a small number of genes are selected, their biological relationship with the target diseases is more easily identified. These “marker” genes thus provide additional scientific understanding of the causes of the disease [
6]. Feature selection plays a fundamental role for increasing efficiency and enhancing the comprehensibility of the results.
In gene selection, genes are evaluated based on (i) their individual relevance to the target class, (ii) the redundancy level respect to other genes, and (iii) how the gene interacts to other genes [
10]. The relevance and the redundancy level of a gene are often measured by correlation coefficients such as: Pearson’s correlation [
11], Mutual Information (
MI) [
12], Symmetrical Uncertainty [
13] and others. On the other hand, it is said that a gene interacts with other genes if, when combined, it becomes more relevant [
14]. Most of the feature selection algorithms in the literature evaluate features by only using one or two of these aspects, but not using all three of them as a whole. This may lead the algorithm to output low-quality solutions, especially when redundant genes and interacting genes are abundant in the problem. In addition, we have detected that most of feature selection algorithms in the literature suffer from what we call the 
integrality problem (to be defined). Roughly speaking, the 
integrality problem occurs when the relevance of a gene is measured by the average of the correlation of their values with the target class. We will further analyse this problem in 
Section 3.
While not losing sight of the fact that microarray cancer datasets are large and abundant in “noisy” genes, the first goal of this paper is to present a new algorithm able to efficiently detect and select relevant, non-redundant and interacting genes to improve the accuracy of classification algorithms. In order to reach this task:
	  
We first introduce a new feature selection methodology that can avoid the integrality problem.
Second, we present a new simple algorithm that can detect irrelevant, redundant and interacting genes in an efficient way.
Finally, the new algorithm is compared with five state-of-the-art feature selection algorithms in fourteen microarray datasets, which include leukemia, ovarian, lymphoma, breast and other cancer data.
  3. Materials and Methods
In this section, we introduce a new methodology to create feature selection algorithms that take advantage of feature value information to avoid the 
integrality problem mentioned in 
Section 1. For a better understanding of the 
integrality problem, consider the dataset depicted in 
Figure 2.
Note that the 
Symmetrical Uncertainty of 
, 
, 
 and 
 with respect to class 
C is 
, 
, 
 and 
 respectively. Therefore, most of the feature selection algorithms described in 
Section 2, will select 
 as the best feature, and the rest of the features might be selected or not according to their correlation (redundancy score) with 
. However, it is clear that class 
C is perfectly predictable by three one-precedent rules when 
 is selected.
This problem occurs because features in  have at least one value that is highly correlated with a class label and its other values are not correlated with the class. Consequently, if the relevance of these features (genes) is measured by averaging the prediction power of all its feature values, then this feature may be considered irrelevant. We call this phenomena the integrality problem. Note that we call the correlation of a feature value  with respect to a class label of C, to the existing correlation between the binary feature obtained from the respective feature  and a given class label. As an example, the feature value  of feature  is . Note that the correlation between  and the target class C, given that , is maximal.
  3.2. Pavicd: A Probabilistic Rule-Based Algorithm
We now introduce a new algorithm, namely Pavicd (Probabilistic Attribute-Value Integration for Class Distinction), which is based on the methodology aforementioned. In order to develop the algorithm, we take into account three aspects: first, how to deal with non-binary datasets, second, how to build  for each class label , and third, to develop functions to measure the relevance, redundancy and interaction score of feature values.
The first step in Pavicd is the preparation of data. Since the proposed methodology is based on the evaluation of feature values, instead of features, dealing with non-binary data can be difficult. However, to deal with non-binary data, Pavicd builds a new space of binary features through the decomposition of each feature  in (v number of feature values of ) new binary features where each one of them take value “1” in the position, where the respective feature value appears in the original feature and takes value “0” in the other positions. Note that this conversion is reversible because the original feature could be obtained through the union of its binary features. With this transformation, a feature is analysed piecemeal, so that its most intrinsic useful information to predict a given class label is easily identified.
The second step is to determine, and store in  which of the entire sets of feature values are relevant for a given class label . Here, we adopt a very simple approach that consists of selecting the covering or reliable feature values for a given class label . Note that we use two thresholds, namely  and , to fix the lower bound value for the selection of covering and reliable values, respectively.
Definition 1. A feature value  is said to be covering with respect to the class label  if  and .
 Definition 1 suggests that a feature value  is covering with respect to the class  if the conditional probability of  given  is the largest among all the class labels in C. Note that all features values in  are covering for at least one class label. Therefore, we use the threshold  to discriminate between “good” covering values and "bad" covering values for a given class label.
Definition 2. A feature value  is said to be reliable with respect to the class label  if  and .
 According to Definition 2, a feature value is likely to be reliable for a given class  if it occurs many times in  and almost does not occur in the rest of the class labels. Note that, again, we introduce a new threshold  to filter the feature values.
In the third step, we carried out the Integration Analysis by means of a 
sequential forward search. The 
sequential forward search is twofold. First, the best feature value in 
 is identified and included in the current solution set 
; and second, the 
sequential forward search itself is performed. To select the best feature value in 
, we use the following evaluation function:
		
This measure is equal to 1 when 
 completely covers 
 and does not occur in any other instance with different class (as 
feature values , 
 and 
 in the example of 
Figure 2), and it takes value 0 if the feature value does not occur in any of the instances labelled with class 
. In other words, we may expect that the best 
feature value is a highly-covering and highly-reliable one. For the 
sequential forward search, we start with 
 equal to the 
feature value in 
 that maximizes Equation (
7), and, then, in each iteration, we explore 
 so that 
feature value that maximizes 
 is selected, and 
feature value  such that 
 holds, is removed from 
 and never tested again. Note that, since Pavicd deals with binary features (or 
feature values), the current solution 
 is also a binary feature because it is the result of one of “AND” or “OR” operators between two binary features. This is briefly explained below.
To evaluate how good a feature value  is with respect to the already selected set , we use the following set of rules:
		
Rule 1. If both  and  are covering feature values, then 
Rule 2. If both  and  are reliable feature values, then 
Rule 3. If neither Rule 1 or Rule 2 hold, then apply the Rule (1 or 2) that maximizes .
Note that  is treated as a feature value (or a binary feature) because every time a feature value  is “added” to ,  is transformed to  or  if Rule 1 or Rule 2 holds, respectively. Algorithm 1 shows the pseudo code of Pavicd.  
| Algorithm 1: Algorithm of Pavicd | 
 ![Information 09 00006 i001]()  |