1. Introduction
Recently, bioinformatics technologies have provided efficient ways to diagnose diseases, and machine learning methods applied in bioinformatics have achieved remarkable breakthroughs in the field of disease diagnosis [
1]. Disease classification based on gene expression levels can efficiently distinguish disease-causing genes efficiently, so it has become an effective method in disease diagnosis and gene expression levels assessment for different conditions [
2,
3,
4]. The combination of data preprocessing and machine learning is an essential approach that improves the performances of many computer-aided diagnosis applications [
5,
6], including for log-count normalized original data in linear modelling [
7]. Many state-of-the-art biological methods have been developed for disease classification. For example, a multiple feature evaluation approach (MFEA) of a multi-agent system has been proposed to improve the diagnoses of Parkinson’s disease [
8]. A high-quality sampling approach has been proposed for imbalanced cancer samples for pre-diagnosis [
9]. Supervised discriminative sparse principal component analysis (SDSPCA) has been used to study the pathogenesis of diseases and gene selection [
10].
However, disease classification using gene expression data also faces challenges because of the characteristic high dimensions and small sample sizes [
11]. Generally, large quantities of unlabelled samples are contained in datasets, because whole-genome gene expression profiling is still too expensive to be used by typical academic labs to generate a compendium of gene expression for a large number of conditions [
12]. To improve the classification performance, semi-supervised learning, an incremental learning technique, has been designed to utilize unlabelled samples to obtain more labelled data. Semi-supervised learning has achieved many successful applications, for example, the semi-supervised functional module detection method based on non-negative matrix factorization [
13] and semi-supervised hidden Markov models for biological sequence analysis [
14]. Moreover, self-training is a special semi-supervised learning method that can implement learning from high- to low-confidence samples [
15]. For example, self-training subspace clustering with low-rank representation has been proposed for cancer classification based on gene expression data [
16]. A self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for gene identification [
17]. Moreover, common classifiers do not achieve satisfactory accuracy because the number of samples is much smaller than the number of genes in gene expression data. To tackle these problems, a classifier named the forest deep neural network (FDNN) has been developed to integrate a deep neural network architecture with a supervised forest feature detector in RNA-seq expression datasets [
18]. In addition, cancer subtype classification with deep learning can be used for single sample prediction to facilitate clinical implementation of cancer molecular subtyping [
19]. The deep forest (DF) model, a decision tree ensemble approach with a non-neural network style deep model, is used in this work because it has been shown to achieve good performance in many tasks [
20]. Furthermore, the deep forest exploits two types of forests, i.e., random forests (RFs) and completely random tree forests, which help enhance the diversity. Motivated by the lack of relevant research, we attempt to exploit the deep forest method for semi-supervised learning in biological tasks.
Many regularization methods have been proposed to identify significant genes to achieve high-performance disease diagnosis. Regularization methods have recently attracted increased attention in gene selection and have become a key technique to prevent over-fitting [
21]. For example, a popular regularization term, the
penalty, i.e., the Least Absolute Shrinkage and Selection Operator (LASSO), can assign redundant coefficients to zero for gene selection and has been applied to high-dimensional data [
22,
23]. Research on disease-causing gene selection involving the extended LASSO includes identification of context-specific gene regulatory networks with gene expression modelling using LASSO [
24] and inference of gene expression networks with a weighted LASSO [
25]. Stable feature selection can avoid negative influences when new training samples are added or removed [
26]. Therefore, we investigate stable LASSO regularization to identify disease-causing genes in disease classification. In this paper, we propose a combined deep forest and semi-supervised with self-training (DSST) method to diagnosis diseases. With deep forest as a base model, semi-supervised learning such as self-training provides more high-confidence labelled samples for deep forest training. Three types of disease datasets are applied to our proposed approach to assess its effectiveness and robustness.
The rest of this paper is structured as follows.
Section 2 presents a literature review of the various studies applying machine learning to disease diagnosis, including deep forest and semi-supervised learning.
Section 3 describes our method.
Section 4 introduces the dataset. We discuss the results and performance of our approach in
Section 5. Finally, conclusions are presented in
Section 6.
2. Literature Review
Machine learning methods for disease diagnosis can be traced back to the 1990s [
27]. Since then, various machine learning methods have been investigated and tested for cancer classification. A forward fuzzy cooperative coevolution technique proposed for breast cancer diagnosis has achieved the best accuracy [
28]. A weighted naive Bayesian (NB) method to predict breast cancer status with high F1 score and accuracy has been presented [
29]. Recently, deep learning has achieved great success in various fields such as disease diagnosis. A new neighbouring ensemble predictor (NEP) method coupled with deep learning has been proposed to accurately predict a detected nuclear class label before quantitatively analysing the tissue constituents in whole-slide images to better understand cancer [
30]. The application of deep learning methods to medical images can potentially improve the diagnostic accuracy, with algorithms achieving areas under the curve (AUCs) of 0.994 [
31]. However, the ideal parameters of deep neural networks methods are difficult to determine. The deep forest model implements a novel classifier based on decision tree ensembles that explore how to construct deep models based on non-differentiable modules. Such models offer guidance to improve the underlying theory of deep learning and generate a deep forest exhibiting these characteristics [
32]. Moreover, the number of hyper-parameters is fewer than that of deep neural networks and the complexity of a model can be automatically determined via data correlation. Various experimental results show that the model performance is robust after the hyper-parameters are set. Such models can achieve excellent performance with the default settings, even if data from distinct domains are considered. Many studies of deep forest methods have been developed [
33,
34], and these methods have been successfully used in image retrieval [
35], and cancer subtype classification [
36].
Semi-supervised learning, an active research topic in machine learning in recent years, aims to label an amount of unlabelled data to improve the performance of a model. Many recent successful examples of semi-supervised learning in bioinformatics have been presented. For example, a semi-supervised network to solve the high-dimensional problem of identifying known essential disease-causing genes has been proposed [
37]. Chai et al. proposed a semi-supervised learning method with the Cox proportional hazard and accelerated failure time (AFT) models to predict disease survival time, and the performance of the model exceeded that of the Cox or AFT model alone [
38]. Moreover, self-training, a type of semi-supervised learning, to learn by gradually including high- to low-confidence samples as pseudo-labelled samples has been proposed [
39]. Self-training has been successfully applied to computer vision [
40], data density peaks [
41], computed tomography (CT) colonography [
42] and other fields. In this paper, self-training with deep forest as base learners is used to learn from both labelled and unlabelled instances; in particular, the experiments shows that an ensemble learner provides additional improvement over the performance of adapted learners [
43].
From a biological point of view, most likely only a few genes can strongly indicate targeted diseases, and most genes are irrelevant to cancer classification. The irrelevant genes may introduce noise and reduce the classification accuracy. Given the importance of these problems, effective gene selection methods can help classify different types of cancer and improve the prediction accuracy [
44]. Stability selection provides an approach to avoid many false positives in biomarker recognition by repeatedly subsampling the data and only treating those variables assumed as biomarkers that are always important [
45]. LASSO, as a primary variable selection method, is a popular regularization method and shrinks the regression coefficients towards zeros if their corresponding variables are not related to the model prediction target [
46]. To obtain more sparse solutions, the
norm is proposed, which simply consists of replacing the
norm with the non-convex
norm (0 <
p < 1) [
47]. A multi-stage convex relaxation scheme with a smoothed
regularization is presented to solve problems with non-convex objective functions [
48]. Zeng et al. [
49] investigated the properties of the
(0 <
p < 1) penalties and revealed the extreme importance and special role of the
regularization. Zou and Hastie [
50] indicated that the
(0 <
p < 1) penalty can provide a different sparsity evaluation and that the
(1 <
q < 2) penalty can provide a grouping effect with different
q values.
5. Results
Three common methods are used for comparison to assess the performance of our approach: deep neural networks (DNNs), logistic regression (LR), support vector machine (SVM) and random forest (RF). In the experiments, a portion of the three disease datasets is treated as unlabelled samples to assess the classification accuracy of the proposed method. The labelled and unlabelled samples are randomly selected in every run of the program.
Table 2 provides more details about the distributions of the datasets used in the experiments. The methodology of the tests encompasses 10-fold cross-validation to evaluate the learning of the methods and track the variation in their performance.
The classification performance achieved by the various methods for the three datasets is shown in
Table 3.
Table 3 shows the results on the test set obtained by the five methods. DSST produces the best results. For example, for the lung cancer dataset (GSE4115), the DSST and deep forest (DF) rank first and second, respectively: the accuracy of DSST is 0.7389, which is higher than the values of 0.6618 and 0.5926 achieved by
and
, respectively. The receiver operating characteristic (ROC) curves obtained by the various methods in one run for the three datasets are shown in
Figure 2, and the corresponding AUCs are shown in
Table 3. DSST outperforms the other classifiers and the deep forest model. Moreover, DSST is characterized by greater sparsity than DF and the other models. Clearly, the F1 score of the DSST model is the highest; i.e., the robustness of the model is better than that of the remaining methods, which indicates that the mechanism used to update the pseudo-labelled samples is a crucial improvement for supervised learning model training.
Discussion
To further illustrate the performance of our method in computer-aided diagnosis, stable LASSO is used in this work [
45]. The top-10 ranked genes selected by stable LASSO in the various datasets are listed in
Table 4,
Table 5 and
Table 6. Most stability scores are close to 1, which indicates that the selected genes are robust. Additionally, the
p-values indicate that the results are significant. Many studies consider function analysis for gene expression. For example, USP6NL in
Table 4 acts as a GTPase-activating protein for RAB5A [
51]. LMX1A in
Table 5 acts as a tumor suppressor to inhibit cancer cell progression [
52]. TP63 in
Table 6 encodes a member of the p53 family of transcription factors, in which the functional domains of p53 family proteins include an N-terminal transactivation domain, a central DNA-binding domain and an oligomerization domain [
53].
Meanwhile, the heat map correlation between the genes is illustrated in
Figure 3. A red colour indicates a positive correlation, while a violet colour indicates a negative correlation. The darker the colour, the stronger the correlation.
Figure 3 shows that most selected genes have a positive correlation. The gene XBP1 of prostate cancer is negatively correlated with the other six genes.
6. Conclusions
In this paper, we proposed deep forest and semi-supervised with self-training (called DSST) to solve disease classification and gene selection problem based on different types of diseases. The deep forest method is consistently superior to other conventional classification methods, possibly because the deep forest approach learns more significant advanced features in the learning process. Semi-supervised learning provides an effective alternative to alleviate the challenges of over-fitting and improves the robustness of the model in the experimental process. Improved experimental results can be obtained by combining semi-supervised learning and the deep forest model. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate computer-aided diagnosis by doctors.
In the experiments, we used datasets for three types of diseases to assess and investigate the performance of our method using trained from 10-fold cross-validation and different sizes datasets. The results show that our proposed disease classification approach has achieved higher prediction accuracy than other methods published in the literature. However, the relevance threshold is different in the context of classification performance when the number of training instances is small. Therefore, how to determine the relevance threshold in the adaptive problem will be a focus of our work in the future. Additionally, we believe that our mechanism can also be applied to other types of disease diagnosis problems and can be expanded to various classifications of disease states.