Developing Sustainable Classification of Diseases via Deep Learning and Semi-Supervised Learning

Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style; i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.


Introduction
Recently, bioinformatics technologies have provided efficient ways to diagnose diseases, and machine learning methods applied in bioinformatics have achieved remarkable breakthroughs in the field of disease diagnosis [1]. Disease classification based on gene expression levels can efficiently distinguish disease-causing genes efficiently, so it has become an effective method in disease diagnosis and gene expression levels assessment for different conditions [2][3][4]. The combination of data preprocessing and machine learning is an essential approach that improves the performances of many computer-aided diagnosis applications [5,6], including for log-count normalized original data in linear modelling [7]. Many state-of-the-art biological methods have been developed for disease classification. For example, a multiple feature evaluation approach (MFEA) of a multi-agent system has been proposed to improve the diagnoses of Parkinson's disease [8]. A high-quality sampling approach has been proposed for imbalanced cancer samples for pre-diagnosis [9]. Supervised discriminative sparse principal component analysis (SDSPCA) has been used to study the pathogenesis of diseases and gene selection [10]. high F1 score and accuracy has been presented [29]. Recently, deep learning has achieved great success in various fields such as disease diagnosis. A new neighbouring ensemble predictor (NEP) method coupled with deep learning has been proposed to accurately predict a detected nuclear class label before quantitatively analysing the tissue constituents in whole-slide images to better understand cancer [30]. The application of deep learning methods to medical images can potentially improve the diagnostic accuracy, with algorithms achieving areas under the curve (AUCs) of 0.994 [31]. However, the ideal parameters of deep neural networks methods are difficult to determine. The deep forest model implements a novel classifier based on decision tree ensembles that explore how to construct deep models based on non-differentiable modules. Such models offer guidance to improve the underlying theory of deep learning and generate a deep forest exhibiting these characteristics [32]. Moreover, the number of hyper-parameters is fewer than that of deep neural networks and the complexity of a model can be automatically determined via data correlation. Various experimental results show that the model performance is robust after the hyper-parameters are set. Such models can achieve excellent performance with the default settings, even if data from distinct domains are considered. Many studies of deep forest methods have been developed [33,34], and these methods have been successfully used in image retrieval [35], and cancer subtype classification [36].
Semi-supervised learning, an active research topic in machine learning in recent years, aims to label an amount of unlabelled data to improve the performance of a model. Many recent successful examples of semi-supervised learning in bioinformatics have been presented. For example, a semi-supervised network to solve the high-dimensional problem of identifying known essential disease-causing genes has been proposed [37]. Chai et al. proposed a semi-supervised learning method with the Cox proportional hazard and accelerated failure time (AFT) models to predict disease survival time, and the performance of the model exceeded that of the Cox or AFT model alone [38]. Moreover, self-training, a type of semi-supervised learning, to learn by gradually including high-to low-confidence samples as pseudo-labelled samples has been proposed [39]. Self-training has been successfully applied to computer vision [40], data density peaks [41], computed tomography (CT) colonography [42] and other fields. In this paper, self-training with deep forest as base learners is used to learn from both labelled and unlabelled instances; in particular, the experiments shows that an ensemble learner provides additional improvement over the performance of adapted learners [43].
From a biological point of view, most likely only a few genes can strongly indicate targeted diseases, and most genes are irrelevant to cancer classification. The irrelevant genes may introduce noise and reduce the classification accuracy. Given the importance of these problems, effective gene selection methods can help classify different types of cancer and improve the prediction accuracy [44]. Stability selection provides an approach to avoid many false positives in biomarker recognition by repeatedly subsampling the data and only treating those variables assumed as biomarkers that are always important [45]. LASSO, as a primary variable selection method, is a popular regularization method and shrinks the regression coefficients towards zeros if their corresponding variables are not related to the model prediction target [46]. To obtain more sparse solutions, the L p norm is proposed, which simply consists of replacing the L 1 norm with the non-convex L p norm (0 < p < 1) [47]. A multi-stage convex relaxation scheme with a smoothed L 1 regularization is presented to solve problems with non-convex objective functions [48]. Zeng et al. [49] investigated the properties of the L p (0 < p < 1) penalties and revealed the extreme importance and special role of the L 1/2 regularization. Zou and Hastie [50] indicated that the L p (0 < p < 1) penalty can provide a different sparsity evaluation and that the L q (1 < q < 2) penalty can provide a grouping effect with different q values.

Semi-Supervised Learning with Deep Forest
The deep forest approach provides an alternative to deep neural networks (DNNs) to learn super-hierarchical representations at low cost. Figure 1 illustrates the basic architecture of the deep forest model. The deep forest approach learns class distribution features directly based on multiple decision trees instead of learning via the hidden layers of DNNs. Additionally, an ensemble of forests can achieve more precise classification of distribution features since the random forest has a potent classification ability. We use previously reported parameter settings [20] to iteratively process the data in the experiments. In our proposed method, the convergence condition is that the training samples (combined original training and pseudo-labelled samples) achieve the best accuracy by employing the obtained pseudo-labelled samples (x) n i . In particular, labelled samples are used to train a base model to label unlabelled samples. Combined labelled and pseudo-labelled samples can then achieve higher performances in gene selection. The deep forest functions as a base model and is similar to the random forest ensemble model. In this paper, high-confidence samples are defined as those with smaller loss values; for example, the closer the y value is closer to 0 or 1 for the logistics regression, the smaller the loss value. These values represent high-confidence samples. can achieve more precise classification of distribution features since the random forest has a potent 130 classification ability. We use previously reported parameter settings [20] to iteratively process the data 131 in the experiments. In our proposed method, the convergence condition is that the training samples

Self-training 140
Consider a pseudo-labelled training dataset D = (X i , y i ) m i=1 and a pseudo-labelled training dataset D = (X i , y i ) n i=m+1 with n samples, where X i ∈ R d is the i th sample and y n i=m+1 is the pseudo-label information according to the training of X i in a classification model. f (X i , w) is a learned model, and w is a model parameter. L(y n i=m+1 , f (X n i=m+1 , w)) is a loss function of the i th sample. The objective of self-training is to simultaneously optimize the model parameter w and latent sample where y , g(λ, v i ) and λ are pseudo-labels of unlabelled data, the self-training regularizer and a penalty that controls the learning pace, respectively. In general, given sample weights v, the minimization over w is a weighted loss minimization problem, independent of regularizer g(λ, v).

Self-Training
Consider a pseudo-labelled training dataset D = (X i , y i ) m i=1 and a pseudo-labelled training dataset D = (X i , y i ) n i=m+1 with n samples, where X i ∈ R d is the i th sample and y n i=m+1 is the pseudo-label information according to the training of X i in a classification model. f (X i , w) is a learned model, and w is a model parameter. L(y n i=m+1 , f (X n i=m+1 , w)) is a loss function of the i th sample. The objective of self-training is to simultaneously optimize the model parameter w and latent sample weights v = [v m+1 , v m+2 , ..., v n ] via a minimization Equation (1).
where y , g(λ, v i ) and λ are pseudo-labels of unlabelled data, the self-training regularizer and a penalty that controls the learning pace, respectively. In general, given sample weights v, the minimization over w is a weighted loss minimization problem, independent of regularizer g(λ, v).

Datasets
In this study, three public cancer datasets from the National Center for Biotechnology Information, U.S. National Library of Medicine (https://www.ncbi.nlm.nih.gov/geo) are utilized. A brief description of these datasets is shown in Table 1.

Lung Dataset
The lung cancer dataset (GSE4115) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE4115) is from Boston University Medical Center. The numbers of lung cancer and healthy samples are 97 and 90, respectively, and each sample contains 22,215 genes.

Prostate Dataset
The prostate cancer dataset (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1524991/) is from the MIT Whitehead Institute. After preprocessing, the prostate dataset contains 102 samples and 12,600 genes in two classes, tumour and normal, which account for 52 and 50 samples, respectively.

Results
Three common methods are used for comparison to assess the performance of our approach: deep neural networks (DNNs), logistic regression (LR), support vector machine (SVM) and random forest (RF). In the experiments, a portion of the three disease datasets is treated as unlabelled samples to assess the classification accuracy of the proposed method. The labelled and unlabelled samples are randomly selected in every run of the program. Table 2 provides more details about the distributions of the datasets used in the experiments. The methodology of the tests encompasses 10-fold cross-validation to evaluate the learning of the methods and track the variation in their performance. The classification performance achieved by the various methods for the three datasets is shown in Table 3. Table 3 shows the results on the test set obtained by the five methods. DSST produces the best results. For example, for the lung cancer dataset (GSE4115), the DSST and deep forest (DF) rank first and second, respectively: the accuracy of DSST is 0.7389, which is higher than the values of 0.6618 and 0.5926 achieved by DF and RF, respectively. The receiver operating characteristic (ROC) curves obtained by the various methods in one run for the three datasets are shown in Figure 2, and the corresponding AUCs are shown in Table 3. DSST outperforms the other classifiers and the deep forest model. Moreover, DSST is characterized by greater sparsity than DF and the other models. Clearly, the F1 score of the DSST model is the highest; i.e., the robustness of the model is better than that of the remaining methods, which indicates that the mechanism used to update the pseudo-labelled samples is a crucial improvement for supervised learning model training.

Discussion
To further illustrate the performance of our method in computer-aided diagnosis, stable LASSO is used in this work [45]. The top-10 ranked genes selected by stable LASSO in the various datasets are listed in Tables 4-6. Most stability scores are close to 1, which indicates that the selected genes are robust. Additionally, the p-values indicate that the results are significant. Many studies consider function analysis for gene expression. For example, USP6NL in Table 4 acts as a GTPase-activating protein for RAB5A [51]. LMX1A in Table 5 acts as a tumor suppressor to inhibit cancer cell progression [52]. TP63 in Table 6     Meanwhile, the heat map correlation between the genes is illustrated in Figure 3. A red colour indicates a positive correlation, while a violet colour indicates a negative correlation. The darker the colour, the stronger the correlation. Figure 3 shows that most selected genes have a positive correlation. The gene XBP1 of prostate cancer is negatively correlated with the other six genes.

Conclusions
In this paper, we proposed deep forest and semi-supervised with self-training (called DSST) to solve disease classification and gene selection problem based on different types of diseases. The deep forest method is consistently superior to other conventional classification methods, possibly because the deep forest approach learns more significant advanced features in the learning process. Semi-supervised learning provides an effective alternative to alleviate the challenges of over-fitting and improves the robustness of the model in the experimental process. Improved experimental results can be obtained by combining semi-supervised learning and the deep forest model. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate computer-aided diagnosis by doctors.
In the experiments, we used datasets for three types of diseases to assess and investigate the performance of our method using trained from 10-fold cross-validation and different sizes datasets. The results show that our proposed disease classification approach has achieved higher prediction accuracy than other methods published in the literature. However, the relevance threshold is different in the context of classification performance when the number of training instances is small. Therefore, how to determine the relevance threshold in the adaptive problem will be a focus of our work in the future. Additionally, we believe that our mechanism can also be applied to other types of disease diagnosis problems and can be expanded to various classifications of disease states.