Next Article in Journal
Inhibitory and Inductive Effects of Opuntia ficus indica Extract and Its Flavonoid Constituents on Cytochrome P450s and UDP-Glucuronosyltransferases
Previous Article in Journal
In situ Degradation and Characterization of Endosperm Starch in Waxy Rice with the Inhibition of Starch Branching Enzymes during Seedling Growth
Article Menu
Issue 11 (November) cover image

Export Article

Open AccessArticle

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique

School of Computer Science and Technology, Anhui University, Hefei 230601, China
Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci. 2018, 19(11), 3398;
Received: 11 October 2018 / Revised: 20 October 2018 / Accepted: 23 October 2018 / Published: 30 October 2018
(This article belongs to the Section Biochemistry)
PDF [2925 KB, uploaded 30 October 2018]


(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods. View Full-Text
Keywords: gene-expression data; feature selection; best first search; classification gene-expression data; feature selection; best first search; classification

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material


Share & Cite This Article

MDPI and ACS Style

Yan, Y.; Dai, T.; Yang, M.; Du, X.; Zhang, Y.; Zhang, Y. Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique. Int. J. Mol. Sci. 2018, 19, 3398.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Int. J. Mol. Sci. EISSN 1422-0067 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top