Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique

(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.


Introduction
As an important technology in the field of bioinformatics, microarray technology is prominent do to its ability to potentially simultaneously measure thousands of gene-expression levels [1,2]. Gene-expression data obtained from microarray experiments are usually confronted with high-dimension and missing-data problems [3,4]. This characteristic generates two problems for downstream gene-expression data analysis (e.g., classification). The first is that data obtained from microarray technology often contains missing values (MVs). MVs present a challenge to traditional analysis models that require a complete data matrix [5,6]. Another problem is the high computational complexity caused by data's high dimensionality [7,8].
Numerous methods were proposed to solve the above two problems in the past few years. MV imputation is the mainstream method that replaces lost entries by an estimation value or a given fixed value [9][10][11][12][13]. Typical representative methods include least-square adaptive (LSA) [9], local least squares (LLS) [10], Bayesian principal component analysis (BPCA) [11], k nearest neighbor (KNN) [12], and partial least square (PLS) [13]. Most MV imputation algorithms applied root mean squared error (RMSE) or its variants as evaluation criteria for the performance of MV imputation algorithms.
In the last decade, several papers focused on the comparison of the impact of different MV imputation methods for downstream statistical analyses [2,[14][15][16][17][18][19][20][21]. Those papers often perform a number of typical methods on several datasets to evaluate the biological impact of downstream analyses, such as biomarker detection and gene-clustering analysis or classification. In Reference [2], the authors found that, among those three statistical analyses, classification was the least sensitive to the choice of imputation method. Other papers also found that some imputation algorithms may be competitive with regard to the selected comparison datasets, but no algorithm is uniformly superior in all datasets [15][16][17][18][19][20]. In Reference [21], the authors applied several MV imputation algorithms on real incomplete cancer gene-expression data to evaluate the performance of each method, and they drew the same conclusion: imputation methods have minor impact on classification.
With respect to the computational complexity caused by the high dimension of gene-expression data, FS is the most common and effective technique to deal with this challenge. Feature selection (FS) aims at selecting informative genes from thousands of genes and uncovers the most relevant biological variables. Many FS algorithms have been proposed in the past few years [22][23][24][25][26][27][28][29][30], and they can generally be roughly divided into four groups [31]: filter, wrapper, embedded, and ensemble strategies. As indicated in Reference [31], many algorithms share common elements, merely differing on details from each other among various methods. Moreover, most FS algorithms for gene-expression data require a complete data matrix as the input, but FS algorithms designed for incomplete data are rare.
As gene-expression data often contain MVs, common practice is to apply MV imputation before conducting FS. One may worry whether MV imputation would affect the FS result. For example, in the preprocess phase, features with more than 50% missing data are often directly discarded. It is worth noting that a feature (gene) may still have a greater degree of relevance (discriminative power) than some complete features although it has more than 50% MVs. Simply removing those incomplete features based on missing rates may be inappropriate in some cases. MV imputation methods usually also need to rely on assumptions about data distribution or missing mechanisms, such as the missing-at-random (MAR) assumption [32]. Unfortunately, no technique exists to verify the MAR assumption [33]. Thus, one problem that needs to be considered is whether downstream FS would be influenced when the assumption is violated.
Our motivation was to introduce a FS algorithm that can handle incomplete gene-expression data without previous MV imputation. Thus, it could avoid the potential impact of MV imputation on subsequent FS results. After FS, we applied MV imputation to the selected feature subset to generate a complete data matrix that could be processed by a traditional machine-learning technique, and the best-first search strategy was applied to search the best top features on the selected data subset. We first introduced a modified chi-square test-based FS technique to select informative features from incomplete data. To meet the challenge of the small sample size characteristic of cancer gene-expression data, a heuristic method called recursive element aggregation was designed. After that, a wrapper methodology and forward best-first search strategy were utilized to identify the best top percentage of genes on the selected feature subset. Moreover, biological inference was conducted on the small, round blue cell tumor (SRBCT) dataset to validate the effectiveness of our method.

Feature-Selection Threshold for MCFS
The MCFS threshold was uniformly evaluated through an experiment in this study. To determine the value of α, the parameter was gradually decreased from 0.01 to 0.00001. For each value of α, we first imputed the MVs on the selected feature subset with the mean value of the observation value of each gene, and then we conducted V-ELM with tenfold cross-validation on the complete feature subset. Finally, the V-ELM performance under each value of α was used to determine the threshold. Table 1 gives the #.genes under various thresholds. When α = 0.00001, the dataset risinger only had one feature, so α should be bigger than 0.00001. As we can see, with the decrease of α, the #.features were declined. The feature subset selected by MCFS with a smaller α was included in the feature subset selected with bigger α.  -v2  1811  1534  1077  924  589  306  alizadeh-v3  1771  1535  1049  891  560  262  bredel  2904  2324  809  551  215  63  chen  5759  5007  3714  3276  2499  1700  garber  2125  1494  705  512  231  75  lapointe-v1  3834  2875  1238  875  349  85  lapointe-v2  7615  6113  3696  3012  1895  957  liang  2349  1302  717  623  17  3  risinger  681  419  114  61  16  1  tomlins-v1  4699  3874  2460  1976  1284  570  tomlins-v2  2650  2874  1678  1320  745  335 Prediction accuracy corresponds to the other five values of α is reported in Table 2. Generally speaking, with α declined from 0.01 to 0.0005, v-elm accuracy is increased gradually. The phenomenon illustrates that more features are removed with the decrease of α, while accuracy increases. It is worth noting that algorithm accuracy increased in most of the datasets (nine of 12), while the performance on garber, liang and risinger declined, especially in liang (about 3% decline) and risinger (about 6% decline). This means that some important genes related to classification were removed when α = 0.0001. Consider the characteristic of MCFS, the feature subset selected by smaller α was included in the feature subset selected by bigger α. The genes selected with α = 0.0005 were also included in the gene subset selected with α = 0.001. Synthesizing the above results, the value of α was reasonable, between [0.0005, 0.001], and we set α = 0.001 in this work. Another parameter that needed to be determined was the k value of the FBFS method. In FB-MCFS, when algorithm performance does not begin to increase, the next k rounds are still evaluated to testify whether the current performance is the best. Figure 1 gives the relationship between k performance and value. Similarly, mean value imputation is applied on the feature subset selected by MCFS; then, FB-MCFS is conducted on the complete matrix. to testify whether the current performance is the best. Figure 1 gives the relationship between k performance and value. Similarly, mean value imputation is applied on the feature subset selected by MCFS; then, FB-MCFS is conducted on the complete matrix. As depicted in Figure 1, algorithm performance is insensitive with the value of k on several datasets (six of 12). For the other six datasets, ali-v3, bredel, chen, liang, lapointe-1, and tomlins-1, when k = 3, algorithm performance was nearly unchanged with the increase of k. The relationship between the performance and the value of k with KNN imputation was also conducted. We found the same phenomenon in the experiment. Detailed results of the experiment are in the Supplementary Material. Thus, we suggest to set k between [3,6], and we set 5 as the FB-MCFS parameter in this paper.

MCFS with or without Preimputation
To validate the effectiveness of MCFS in avoiding potential disturbance in subsequent classification performance that is caused by MV preimputation, a comparison experiment was conducted. For each algorithm, the average results of 20 trails of tenfold cross-validation experiments were reported in Table 3 (best performance is in bold). The last row gives the average reports on all 12 datasets. MCFS1 represents the method by first conducting MCFS and then conducting MV imputation on the selected feature subset. MCFS2 represents the method by first-conduct MV imputation on the original incomplete gene-expression dataset, and then conducting MCFS on the complete obtained dataset.  As depicted in Figure 1, algorithm performance is insensitive with the value of k on several datasets (six of 12). For the other six datasets, ali-v3, bredel, chen, liang, lapointe-1, and tomlins-1, when k = 3, algorithm performance was nearly unchanged with the increase of k. The relationship between the performance and the value of k with KNN imputation was also conducted. We found the same phenomenon in the experiment. Detailed results of the experiment are in the Supplementary Material. Thus, we suggest to set k between [3,6], and we set 5 as the FB-MCFS parameter in this paper.

MCFS with or without Preimputation
To validate the effectiveness of MCFS in avoiding potential disturbance in subsequent classification performance that is caused by MV preimputation, a comparison experiment was conducted. For each algorithm, the average results of 20 trails of tenfold cross-validation experiments were reported in Table 3 (best performance is in bold). The last row gives the average reports on all 12 datasets. MCFS1 represents the method by first conducting MCFS and then conducting MV imputation on the selected feature subset. MCFS2 represents the method by first-conduct MV imputation on the original incomplete gene-expression dataset, and then conducting MCFS on the complete obtained dataset.

Comparison of Algorithm Stability under Three Imputation Methods
Recent research shows that MV imputation has minor impact on classification. Here, we aim to study whether MV imputation would affect the following FS and the corresponding classification performance. To this end, we adopted three imputation methods, BPCA [11], KNN [12], and MEAN. For each of the data subsets selected by an FS method with one imputation technique, 20-trail tenfold cross-validation of V-ELM was conducted to give an average result. Table 4 gives the results (accuracy) of FS algorithms under various MV imputation methods. Performance with more than 0.02 (2%) differences under the three imputation methods is reported in bold. One can see that MCFS is much more stable. In other words, the four comparison FS algorithms are sensitive with MV imputation. The differences in MCFS accuracy are all smaller than 0.02 in all twelve datasets, while NCA, PCA UFF, and ReliefF had six, five, five, and four datasets bigger than 0.02, respectively. It is worth noting that NCA is much more sensitive in the bredel dataset (with a 17 percent difference between KNN and the other two imputation methods). Moreover, UFF was unavailable on the bredel dataset because there were no features selected by UFF. UFF was also sensitive in datasets tom1 and tom2. In general, MCFS is more stable than the FS algorithms compared under the three imputation methods. In other words, this means that beforehand MV imputation would affect the following FS, which needs complete matrix.

Comparison of Gene-Classification Analyses by MCFS and Other Methods
Classification of FB-MCFS was compared with several algorithms under three different imputation methods, respectively. Our objective was to validate the effectiveness of the best-first search strategy. Thus, for convenience, we sequentially increased the feature subset with a given gene percentage for all the datasets. It is worth to note that, the liang dataset only had 37 samples; when we applied tenfold cross-validation, there were only three test samples, and they sometimes had the same class label, which made AUC calculation impossible. Thus, here we applied fivefold cross-validation, and several evaluation metrics were reported. A one-versus-rest strategy was also applied in this study. Figure 2 gives the results on balanced accuracy. Detailed results of the experiments on evaluation metrics in this study and common genes selected by the FS methods NCA, UFF, ReliefF, and MCFS on dataset alizadeh-v3 appear in the Supplementary Material.

Comparison of Gene-Classification Analyses by MCFS and Other Methods
Classification of FB-MCFS was compared with several algorithms under three different imputation methods, respectively. Our objective was to validate the effectiveness of the best-first search strategy. Thus, for convenience, we sequentially increased the feature subset with a given gene percentage for all the datasets. It is worth to note that, the liang dataset only had 37 samples; when we applied tenfold cross-validation, there were only three test samples, and they sometimes had the same class label, which made AUC calculation impossible. Thus, here we applied fivefold crossvalidation, and several evaluation metrics were reported. A one-versus-rest strategy was also applied in this study. Figure 2 gives the results on balanced accuracy. Detailed results of the experiments on evaluation metrics in this study and common genes selected by the FS methods NCA, UFF, ReliefF, and MCFS on dataset alizadeh-v3 appear in the Supplementary Material.
As shown in Figure 2, compared with MCFS, FB-MCFS can improve balanced accuracy on almost all of the datasets. Compared with NCA, PCA, UFF, and ReliefF, FB-MCFS had the best performance in five of 12, five of 12, and four of 12 datasets under three imputation methods, respectively. Because of the unstable performance of the FS algorithms that need beforehand MV imputation, algorithm performance cannot have consistent results. For example, for the bredel dataset, NCA performance was better than FB-MCFS with BPCA and MEAN imputation. However, NCA had a significant gap in balanced accuracy when compared with FB-MCFS under KNN imputation. In general, experiment results demonstrated the effectiveness of FB-MCFS and the stability of MCFS by filtering features without beforehand MV imputation. Algorithm performance was statistically compared with the Friedman statistical test for multiple comparisons between FB-MCFS and the other five algorithms according to the procedures described in Reference [34]. Table 5 gives the results; p values below 0.05 are reported in bold.

Gene and Pathway Analyses for the SRBCT Dataset by MCFS
In this section, study on an SRBCT dataset (according to Reference [35]) is outlined to validate whether the proposed feature-evaluation criterion is biologically meaningful.

Selecting Most Relevant Genes with MCFS
To select the most relevant genes in SRBCT, MCFS was used to select 231 genes with a 0.0001 threshold. After that, to get the most accurate selection of relevant genes, genes were ranked based on the p-value. Finally, genes were incrementally added to a feature subset and 100 trails of v-elm were conducted to evaluate the feature subset in terms of the average accuracy of the 100 v-elms. Figure 3 reports the top #.genes versus the corresponding average accuracy. 103 genes were selected by our method. Figure 4 reports the p-value of the 103 genes. A smaller p-value means a greater relevance degree (higher ranking).  Algorithm performance was statistically compared with the Friedman statistical test for multiple comparisons between FB-MCFS and the other five algorithms according to the procedures described in Reference [34]. Table 5 gives the results; p values below 0.05 are reported in bold.

Gene and Pathway Analyses for the SRBCT Dataset by MCFS
In this section, study on an SRBCT dataset (according to Reference [35]) is outlined to validate whether the proposed feature-evaluation criterion is biologically meaningful.

Selecting Most Relevant Genes with MCFS
To select the most relevant genes in SRBCT, MCFS was used to select 231 genes with a 0.0001 threshold. After that, to get the most accurate selection of relevant genes, genes were ranked based on the p-value. Finally, genes were incrementally added to a feature subset and 100 trails of v-elm were conducted to evaluate the feature subset in terms of the average accuracy of the 100 v-elms. Figure 3 reports the top #.genes versus the corresponding average accuracy. 103 genes were selected by our method. Figure 4 reports the p-value of the 103 genes. A smaller p-value means a greater relevance degree (higher ranking).  Among the 103 genes, 45 and 27 of the genes were common with the genes selected in References [36,37], respectively. The top 30 genes are given in Table 6. Among the top 30 genes, 18 and 10 genes were reported in References [36,37], respectively. Details about the 103 genes can be seen in the Supplementary Material.

Function Analysis of the Selected Genes
The small, round blue cell tumors (SRBCTs) tend to occur in childhood. SRBCTs include neuroblastoma, non-Hodgkin lymphoma, rhabdomyosarcoma, and the Ewing family of tumors, and they have similar appearance on routine histology. Chromosomal abnormality analysis and molecular probes are usually used to help pathologists.
Several genes from Table 6 corroborate with each other according to existing research results. Figure 5 gives the protein-protein interaction (PPI) networks that were experimentally validated [38]. For classification, some of the listed genes that appeared in PPI network can indicate the importance of the validated biological process in the task. Generally speaking, several genes are involved in Wnt (wingless related integration) signaling (e.g., TLE2 (transducin-like enhancer of split 2) and TCF7L2 (transcription factor 7-like 2)), cytoskeleton regulation, and cell migration and adhesion. Among the 103 genes, 45 and 27 of the genes were common with the genes selected in References [36] and [37], respectively. The top 30 genes are given in Table 6. Among the top 30 genes, 18 and 10 genes were reported in References [36] and [37], respectively. Details about the 103 genes can be seen in the Supplementary Material. ′insulin-like growth factor 2 (somatomedin A)′ 1 2

Function Analysis of the Selected Genes
The small, round blue cell tumors (SRBCTs) tend to occur in childhood. SRBCTs include neuroblastoma, non-Hodgkin lymphoma, rhabdomyosarcoma, and the Ewing family of tumors, and they have similar appearance on routine histology. Chromosomal abnormality analysis and molecular probes are usually used to help pathologists.
Several genes from Table 6 corroborate with each other according to existing research results. Figure 5 gives the protein-protein interaction (PPI) networks that were experimentally validated [38]. For classification, some of the listed genes that appeared in PPI network can indicate the importance of the validated biological process in the task. Generally speaking, several genes are involved in Wnt (wingless related integration) signaling (e.g., TLE2 (transducin-like enhancer of split 2) and TCF7L2 (transcription factor 7-like 2)), cytoskeleton regulation, and cell migration and adhesion.   Figure 5 are as follows: In (a), Dihydropyrimidine dehydrogenase (DPYD); carbamoyl-phosphate synthase 1 (CPS1); carbamoyl-phosphate synthetase 2, aspartate transcarbamylase, and dihydroorotase (CAD); uridine monophosphate synthase (UMPS); collapsin response mediator protein 1 (CRMP1); dihydropyrimidinase-like 3(DPYSL3); dihydropyrimidinase-like 2 (DPYSL2); semaphorin 3A (SEMA3A); cyclin-dependent kinase 5 (CDK5) and FYN is a 59-kDa member of the Src family of kinases typically associated with T-cell and neuronal signaling in development and normal cell physiology.
Some reported genes (including LGALS3BP, DPYSL2, EPHB4, DAPK1, EFNA1, MAPK1, FGFR1 (fibroblast growth factor receptor 1), CRMP1, and SPARC) belong to cell adhesion and migration regulators. Moreover, among these genes, some are also related to cytoskeleton regulation (DPYSL2, MAPK1 and CRMP1). However, the roles of these genes in tumor-subtype classifiers are not obvious and still need to be experimentally validated.
In addition to the top 30 genes, many of the remaining reported genes were experimentally proved to be associated with the tumorigenesis process [39]. For example, LTA (lymphotoxin alpha) has been proved to be highly related with non-Hodgkin lymphoma; for ANXA1 (annexin A1), the loss of function or expression of this gene has been detected in multiple tumors; and the gene product of SPARC has been correlated with metastasis based on changes to cell shape that can promote tumor cell invasion. It is possible that those genes were altered in tumors, but play weaker roles in classifying different tumor subtypes of SRBCT.

Design and Analytical Flowchart
The flowchart of our proposed method (first best based on modified chi-square test feature selection: FB-MCFS) is shown in Figure 6. It mainly includes three steps: modified chi-square test-based feature selection (MCFS), missing value imputation and the forward best-first search procedure. In MCFS, a modified chi-square test procedure is introduced to evaluate the importance degree (p value) of each gene of the original incomplete expression dataset. Moreover, to meet the small sample size challenge of cancer gene-expression data, a heuristic recursive element aggregation process was proposed to make the chi-square approximation more accurate (need activation conditions). Genes were then selected with a given parameter to construct an incomplete data subset. After that, missing-value imputation was conducted on the selected incomplete data subset to generate a complete data subset. Finally, forward best-first search strategy was utilized to identify the best top percentage of genes on the selected feature subset (complete) with extreme learning machine as the base classifier.
process was proposed to make the chi-square approximation more accurate (need activation conditions). Genes were then selected with a given parameter to construct an incomplete data subset. After that, missing-value imputation was conducted on the selected incomplete data subset to generate a complete data subset. Finally, forward best-first search strategy was utilized to identify the best top percentage of genes on the selected feature subset (complete) with extreme learning machine as the base classifier.  Figure 6. Flowchart of the proposed method. The dotted line divide the framework into three main steps.

Extreme Learning Machine (ELM)
ELM is a new emerged technique for single-layer forward networks (SLFN) in the past decades; it features much faster training speed and better generalization performance over traditional learning techniques. It is also a special type of neural network. ELM can analytically determine output weights by randomly selecting weights and biases for hidden nodes by using the least-square method without time-consuming learning iterations.
For an arbitrary training set consisting of N samples (x i , y i ) with x i ∈ R d 1 and y i ∈ R d 2 , the output of an SLFN with M hidden neurons is: where g(.) is the hidden activation function; and β j ∈ R d 2 , ω j ∈ R d 1 , and b j ∈ R are the learning parameters of the jth hidden node, respectively. For all N samples, a compact form of system (1) can be written as: where H is the output matrix, and H ij = g(ω j , b j , x i ), β = (β 1 , β 2 , . . . , β M ) and Y = (y 1 , y 2 , . . . , y N ). Let T = (t 1 , t 2 . . . t N ) be the target output matrix. To minimize network cost function ||Y-T||, ELM claims that, with randomly initialized input weights and biases for the SLFN, System (1) becomes a linear model and output weights can be analytically determined by finding a least-square solution of linear System (1) as: where H † is the Moore-Penrose generalized inverse [48] of hidden-layer output matrix H. More details about the theoretical proofs of the ELM are in the original paper [49]. The universal approximation property of the ELM is also presented in Reference [50] to support the algorithm. Because weights and biases remain unchanged during the training phase, some parameters may be non-optimal. Some samples near the classification boundary may be misclassified by ELM. So, to reduce the number of such misclassified samples, a voting-based ELM (V-ELM) is proposed by Cao [51]. In this work, we applied V-ELM as the base learning algorithm.

MCFS for Incomplete Data
Given an incomplete dataset D, we used '?' to denote the MVs. Let A be a feature of D that has m values (except for '?'), and d is the class variable with l values (except for '?'). For each pair of (a i , d j ), a i is a value of feature A and d j is a value of d. The occurrence count of (a i , d j ) (denoted by f ij ) is increased by fractions of numbers of occurrences (?, d j ), (a i , ?) and (?, ?) for features (A, d) [29].
The chi-square statistic value of (A, d) can be obtained by The p-value can be computed from χ 2 (A,d) and the freedom degree (m − 1) × (l − 1) [52]. A larger p-value means a smaller relevance degree of A with respect to d. In the FS scenario, a given significance level value α is applied to select features with p-value smaller than it, and the features can also be sorted by p-value (ascending order with p-value means the descending order with relevance degree).

Example 1.
We present an example table with missing values to illustrate the chi-square-based feature-evaluation algorithm for incomplete data (Table 9). u 1 , u 2 , u 3 , u 4 , u 5, and u 6 are records. a 1 , a 2, and a 3 are features, and d is the class variable.  and construct the frequency table as Table 10: (2) Calculate the following summations of the frequency table: Summation of rows: Summation of columns: c 1 = 4, c 2 = 2; 2 ∑ j=1 f ij = 6 (f ij denote the element of the i th row and the j th column); (3) Update element f ij (i∈ [1,2,3], j∈ [1,2]): Table 11:

Recursive-Element Aggregation
Gene-expression data often have a small number of samples; this presents a challenge to the chi-square test. A chi-square test needs 80% of elements of the expected frequencies bigger than 5, and this approximation breaks down if expected frequencies are too low [25]. To meet this characteristic of gene-expression data, we propose a recursive-element aggregation algorithm to make the approximation more accurate.
For a contingency table M as in Table 2, let f ij be the element of the i th row and j th column of M. Let M E be the expected frequency table which is composed of the expected frequencies corresponding to the elements of M. Suppose E ij is the smallest expected frequency corresponding to f ij .
Our purpose is to test the dependence between a gene and the class variable. For a contingency table that did not satisfy the condition of the chi-square test, we merged the row with its adjacent row to aggregate elements. Algorithm 1 gives the process of recursive-element aggregation.  gives an example to describe recursive-element aggregation. In Figure 7a, expected frequency table ME 1 can be obtained from M 1 according to Equation (5). After that, ratio (R f ) of expected frequencies bigger than 5 in ME 1 is calculated. Here, R f < 0.8, so element aggregation was activated. The smallest element (marked in red background) in ME 1 was firstly found, and then the corresponding row in contingency table M 1 (marked in orange background) was merged with its adjacent row (marked in yellow background), which has more elements smaller than 5. By doing this, we have contingency table M 2 . The above processes were iteratively conducted until R f > 0.8, or the number of rows of the contingency table equalled to two. In Figure 7c, R f > 0.8, so the element-aggregation process was terminated, and contingency table M 3 was applied to calculate the chi-square statistic value according to Equation (6). Element aggregation is a heuristic process. After element aggregation, an optimized contingency table can be used to calculate the p-value based on Equation (6). We used the MCFS to get the p-value of feature A with respect to d, and then features were sorted in ascending order with p-value (the descending order with relevance degree). In this study, a threshold was given to determine whether a feature was removed or selected. Element aggregation is a heuristic process. After element aggregation, an optimized contingency table can be used to calculate the p-value based on Equation (6). We used the MCFS to get the p-value of feature A with respect to d, and then features were sorted in ascending order with p-value (the descending order with relevance degree). In this study, a threshold was given to determine whether a feature was removed or selected.

Forward Best First Search on the MCFS Feature Subset
In this work, incomplete genes with a small relevance degree were removed by MCFS. The selected features may still have redundant genes, so we applied Best First Search (BFS) to search the best top R% features on selected feature subsets. BFS is a standard search technique [53] that can be divided into two categories: Forward BFS (FBFS) and Backward BFS (BBFS). In recent years, the BFS technique has been successfully applied in bioinformatics [54,55]. FBFS is more efficient than BBFS in general; thus, we designed an FBFS strategy on the MCFS feature subset in this subsection.
Genes were ranked according to p-value from smallest to biggest (relevance degree from biggest to smallest). FBFS begins with an empty set, and genes in the MCFS feature subset were selected from top-R% to 100% with the increment of R%. For each round, we first imputed the MVs, then we conducted V-ELM [51] on the data subsets and evaluated V-ELM performance. If performance did not begin to increase, the following k rounds were still evaluated. If the current best performance could not be improved, then the FBFS was terminated. For a given ratio of #.features (pof ), a group of feature subsets {S 1 , S 2 . . . S K } (K = 100/pof ) could be obtained. Figure 8 gives the flowchart of the FBFS method on the MCFS feature subset.

Comparison Settings
Tenfold cross-validation was applied in this section. V-ELM was applied as the base learning algorithm with a sigmoid activation function. We set 30 as the independent training number of ELM based on a study [51]. Thirty experiment trails were conducted for each hidden node, and the average results are reported. As an FS method, performance of the MCFS was firstly compared with several

Comparison Settings
Tenfold cross-validation was applied in this section. V-ELM was applied as the base learning algorithm with a sigmoid activation function. We set 30 as the independent training number of ELM based on a study [51]. Thirty experiment trails were conducted for each hidden node, and the average results are reported. As an FS method, performance of the MCFS was firstly compared with several FS algorithms (neighborhood component feature selection: NCA [56,57], unsupervised feature filtering: UFF [36], and ReliefF [58]) to validate feasibility in terms of FS. To enrich the comparison, PCA [59], which belongs to feature extraction, was also applied in this paper. NCA is a filter-based algorithm, and features with scores larger than 0.000001 were selected in this work. For PCA, the cumulative summation with a top-k biggest eigenvalue bigger than 0.999 was used to transform the original data into a new feature space. UFF is an unsupervised FS algorithm. For a feature i, UFF scores it using a leave-one-out calculation of (singular value decomposition) SVD entropy to illustrate the possible impact on the FS algorithm with different beforehand MV imputation methods. For ReliefF, we set k = 4, and features with weights bigger than 0.05 were selected in this work. For NCA, PCA, UFF, and ReliefF, the imputation method was first applied to construct a complete matrix for the following FS. For our method, MV imputation was also conducted on feature subsets selected by MCFS. Considering that, through removing the most irrelevant genes, features selected by MCFS might still have some redundant genes with the others. This work proposes a forward best-first search-based framework for the MCFS feature subsets.

Discussion
Microarray technology enables researchers to investigate and address issues that were once thought to be non-traceable. Gene-expression data are important data obtained from microarray experiments. However, gene-expression data often encounter missing data and high-dimensional problems. Thus, MV recovery and FS have been two basic types of study of gene-expression data in bioinformatics over the past two decades. Recent studies show that, for real data, MV imputation has minor impact on downstream classification tasks, but MV imputation is based on the MAR assumption; however, the impact of MV imputation on subsequent FS is seldom considered.
We have investigated the impact of different MV imputation methods on subsequent FS in terms of classification. Three imputation methods and several FS algorithms were evaluated in this study. We have found that, for classification tasks, MV imputation has greater influence on subsequent FS in terms of classification performance. Through directly conducting FS on incomplete data, and then filling the missing data on the selected dataset, our approach can avoid the potential influence of beforehand MV imputation on subsequent FS in classification performance. With a proper threshold, our FS algorithm can remove most irrelevant genes, which could make downstream analysis (classification) more efficient.
As a filtering FS algorithm, the dataset selected by our approach may not be optimal. We believe that the real relevance degree of some genes cannot be accurately measured because of missing data in the original datasets. Thus, we suggest to select a few more genes by setting a relatively larger threshold (smaller relevance degree) to avoid possible discarding of potential genes with a relatively higher importance degree. After that, we utilized wrapper methodology and forward best-first search strategy to identify the most informative genes on the selected data subset. It should be emphasized that our criterion is applicable to both complete and incomplete datasets.
Our algorithm proposes to use a heuristic recursive-element aggregation procedure to increase the accuracy of the downstream chi-square approximation. When the size of the gene-expression data sample is very small, the element aggregation procedure is likely to be prematurely terminated (rows of contingency table equaling to two). At this instance, our method cannot reach a satisfying result in evaluating p-value.
Biological inference on SRBCT was also conducted to study whether our criterion is biologically meaningful. Many genes selected by our method agreed with the genes found in an (Artificial Neural Network) ANN-based technique and a UFF criterion. Moreover, some of the genes were found in several validated PPI networks, which indicates the importance of the identified biological process in classification tasks. Our criterion also suggests potential new features. However, the roles of these potential genes in tumor subtypes are not obvious, and still need to be experimentally validated.