Next Article in Journal
A Method of Generating Fuzzy Implications with Specific Properties
Previous Article in Journal
Doubly Charged Higgs Bosons and Spontaneous Symmetry Breaking at eV and TeV Scales

Symmetry 2020, 12(1), 154; https://doi.org/10.3390/sym12010154

Article
Classification of Kidney Cancer Data Using Cost-Sensitive Hybrid Deep Learning Approach
1
Medical Research Institute, Chungbuk National University, Cheongju 28644, Korea
2
College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea
3
Department of Nursing, Woosong College, Daejeon 34606, Korea
4
Department of Biomedical Engineering, School of Medicine, Chungbuk National University, Cheongju 28644, Korea
*
Author to whom correspondence should be addressed.
Received: 15 December 2019 / Accepted: 10 January 2020 / Published: 11 January 2020

Abstract

:
Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.
Keywords:
data mining; machine learning; kidney cancer; bioinformatics; autoencoder; neural network; cost-sensitive; hybrid deep learning; cancer classification

1. Introduction

Using bioinformatics approaches to identify genes that are useful for the diagnosis and prognosis prediction of patients with cancer can foster treatment. The analysis of cancer data is important yet difficult due to the large amounts of gene expression data available. Thus, only significant features that can express the health condition of patients must be extracted. Additionally, the development of efficient classification models based on the extracted genes is helpful for early diagnosis and prognosis prediction of patients with cancer. Cancer is caused by gene modifications, which may enable a cell to proliferate exponentially and then permeate normal surrounding cells before spreading through the body. In utilizing deep learning methods to accurately predict the disease condition of patients by analyzing mutations only in the gene sequence, studies have identified genes involved in spinal muscular atrophy, hereditary nonpolyposis colon cancer, and autism [1].
In this study, we extracted genes useful for the prognosis prediction of patients with kidney cancer and then predicted prognosis by applying a classification algorithm based on the gene. Kidney cancer is a primary tumor generated from the kidney, among which malignant renal cell carcinoma accounts for over 90% of cases. Because kidney cancer shows no symptoms at the early stages, it is often diagnosed at a progressive stage. According to registered statistics for cancer in Korea, 5043 kidney cancer cases were diagnosed in 2016, thereby ranking 10th among all cancers. In fact, the annual incidence of kidney cancer increased steadily from 1999 to 2019 [2]. Additionally, the symptoms and treatment of kidney cancer decrease the quality of life of the patients by increasing the disease burden and medical costs. Lifestyle factors, such as poor diet, physical inactivity, smoking, and alcohol consumption, are associated with an increased risk of kidney cancer. Additionally, genetic and environmental factors influence all of these risk factors and diseases, such as diabetes, hypertension, and obesity [3].
There have been various successful applications of machine learning and data mining techniques to bioinformatics and genomics [4] research. For example, PathAI was implemented for digital pathology after the analysis of image data from patients with breast cancer using artificial intelligence, which decreased the error rate of diagnosing metastasized cancer through deep learning [5]. Additionally, a study [6] at Emory University analyzed the survival rate of patients with brain tumors by combining gene data with pathology image data, and this showed a very high accuracy of survival rate prediction. It was reported that the deep learning convolutional neural networks achieved higher accuracy than pathologist-based diagnosis in the prediction of survival rate [6]. Another study predicted the degree of risk of approximately 20 cancers by applying machine learning and artificial intelligence to analyze gene-related big data [7]. Over the years, various technologies for data mining have been applied. Specifically, a deep learning method was applied to infer the expression of target genes from the expression of landmark genes [8]. The performance of the tested method outperformed other machine learning algorithms significantly. Recent studies were also conducted to develop a classification model system for diagnosing disease and cancer using machine learning [9,10].
Most studies have been conducted to extract features using genome data from patients with kidney cancer by data mining, statistical methods, and classification algorithms [11,12,13]. Various bioinformatics and genomic data have also been applied in algorithms based on machine learning [14,15,16]. Recently, due to the advantages of deep learning, various deep learning approaches have been applied to the research of cancer using gene expression data [17,18,19]. Deep learning approaches are useful for constructing predictive models and feature extraction: Where higher levels represent more abstract entities, they map the lowest input layer to the uppermost output layer without using hand-crafted features or rules [20,21]. Using data from The Cancer Genome Atlas (TCGA) [22], we used a deep learning approach in a prior study to extract genes related to cancer by combining RNA sequencing and DNA methylation data. We evaluated breast invasive carcinoma, thyroid carcinoma, and kidney renal papillary cell carcinoma [23].
In this study, we combined gene expression and clinical data from patients with kidney cancer from TCGA and applied our proposed deep learning, end-to-end COST-HDL approach. We compared the proposed approach with several traditional data mining and machine learning methods that are not implemented end-to-end. These methods have multiple steps such as feature engineering, over- and under-sampling, and classification. The objectives of this study are to extract deep features from gene biomarkers for precisely predicting prognosis, overcome differences in various types of cancer data, and develop an end-to-end prediction model by comparing and analyzing classification algorithms using the extracted genes. The major contributions of this paper can be summarized as follows: (1) We propose an end-to-end approach without any manual engineering, which predicts kidney cancer prognosis including sample type, primary diagnosis, tumor stage, and vital status. (2) We propose a non-linear transformation strategy, deep symmetric autoencoder, to extract deep features from gene biomarkers in kidney cancer by taking advantage of deep learning structure. (3) We propose a mixed loss function for the proposed deep learning model, both considering compression of knowledge representation and data imbalanced problem.
The remainder of the paper is organized as follows: Section 2 introduces the gene expression dataset from patients with kidney cancer and explains the proposed deep learning approach in detail. In Section 3, the experimental results are provided. Finally, Section 4 discusses the experimental analysis, and addresses our conclusion.

2. Materials and Methods

2.1. Dataset

TCGA contains a variety of gene information such as single-nucleotide polymorphism (SNP) and gene expression (mRNA expression) data from large numbers of patients with cancer, which are stored in a database [22]. We collected TCGA data from 1157 patients with kidney cancer and other clinical information including sample type, primary diagnosis, tumor stage, and vital status. Each clinical information is used as class labels in the prognosis prediction task. The degree of gene expression was estimated at the RNA level, and the expression data (transcriptome profiling) were merged and digitized after assigning transaction IDs. We used 60,483 gene expression data points from each patient with kidney cancer, values expressed with the Fragments Per Kilobase per Million mapped (FPKM) measure [24]. The kidney cancer dataset was used to extract the complex structure of gene biomarkers and estimate classification accuracy as risk factors by sample type, primary diagnosis, tumor stage, and vital status representing the state of patients.
The statistics of the dataset are shown in Table 1. In the preprocessing step, we removed all no variance gene expression data and other noisy samples. Varying samples and gene expression data sizes were used for the prognoses, and they were split into 80% for training and 20% for testing. The datasets are highly imbalanced, especially the dataset of sample type prognosis, which contains 87.9% primary tumor samples and 12.1% solid tissue normal samples.
In the analysis, we applied a cost function to solve this data imbalance problem and compared it with other sampling methods. We also used the DAE model to extract the high dimension of gene expression data and compared it with other feature-selection and dimension-reduction techniques.

2.2. The COST-HDL Approach

In the experiments, the extracted target genes were subject to classification analysis, and the performance was evaluated. Figure 1 shows the proposed COST-HDL approach which input the gene expression data of kidney cancer from the TCGA portal and output four kinds of prognoses namely, sample type, primary diagnosis, tumor stage, and vital status. It consists of a hybrid of DAE and NN models. For the RNA sequencing data, the number of variables is significantly higher than the number of samples. Therefore, general classification analysis is prohibited by technical challenges in dealing with more than 60,000 variables: it is challenging to apply the data mining and machine learning algorithms to the raw dataset. Therefore, in this study, we used the 5-layer DAE model (the first 2 layers for encoding, the middle layer for gene extraction, and the last 2 layers for decoding) to extract significant genes and extract deep features from gene biomarkers as a result. The extracted deep features were input to the NN classification method (hidden layer + dropout [25] + Rectified Linear Unit (ReLU) [26] + softmax [27]).
The DAE model employed the mean squared error (MSE) as a reconstruction loss during the training, while the NN model used the focal loss [28] as a balanced classification loss. Focal loss is the reshaping of cross-entropy loss such that it down-weights the loss assigned to well-classified examples. The novel focal loss focuses on training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. The proposed COST-HDL approach uses the sum of the reconstruction loss and balanced classification loss as a cost function.
The experimental hardware platform was the Intel Xeon E3 (32G memory, GTX 1080 Ti). We used Ubuntu 18.04 as the computational environment, and Python 3.7 was used for data collection and analysis. Python 3.7 Library uses Scikit-Learn [29] and Pytorch [30]. The following paragraphs describe the DAE model for extracting deep features from gene biomarkers and the NN model for constructing prognosis prediction models in detail.

2.2.1. Extracting Deep Features from Gene Biomarkers

We utilized the training dataset to extract gene expression data by using the DAE non-linear feature transformation method, and we compared it with Principal Component Analysis (PCA) [31] linear feature transformation and the Least Absolute Shrinkage and Selection Operator (LASSO) [32] feature selection methods. PCA explains correlated multivariate data in a fewer number of linearly uncorrelated variables which are a linear combination of the original variable. Due to the linearity constraints, we developed a DAE with non-linear activation functions which give more accuracy in the reconstruction of data. However, the feature selection methods such as LASSO select the best features or a subset of the original feature set and do not alter the original representation of data [33]. Thus, they may lose some important information during a selection process when extracting a complex structure of cancer data.
We developed the DAE model using Pytorch to extract deep features from gene biomarkers. The architecture of the DAE model consists of encoder and decoder parts. The encoder part comprised one input layer, and three fully connected encoding hidden layers with 1000, 500, and 100 nodes, respectively. The last layer of the hidden layers was chosen to be the deep feature to extract the gene biomarkers. The decoder part comprised two fully connected decoding hidden layers with 500 and 1000 nodes, respectively. The last layer of the hidden layer was chosen to be the output layer (reconstructed input). These are used to transpose the encoding layer weights. The procedure can be formulated as below:
h i d d e n _ e n c o d e 1 = R e L U ( W 1 × i n p u t + b 1 ) h i d d e n _ e n c o d e 2 = R e L U ( W 2 × h i d d e n _ e n c o d e 1 + b 2 ) h i d d e n _ e n c o d e 3 = W 3 × h i d d e n _ e n c o d e 2 + b 3 h i d d e n _ d e c o d e 1 = R e L U ( W 2 × h i d d e n _ e n c o d e 3 + b 2 ) r e c o n s t r u c t e d _ i n p u t = T a n h ( W 1 × h i d d e n _ e n c o d e 1 + b 1 )
where W 1 , W 2 , and W 3 are the weight metrics between the layers with the size of N   ×   1000, 1000   ×   500, and 500   ×   100, respectively; N is the size of input or number of samples; b 1 ,   b 2 , and b 3 are the biases for each node; and R e L U and T a n h are non-linear activation functions. The terms with superscripts refer to the transpose metrics. The h i d d e n _ e n c o d e 3 layer was chosen to be the activity values of the deep features in this model. The DAE has a loss function to handle the data reconstruction error which can measure the error between the original data and the reconstructed data, and it employed the MSE as its loss function.

2.2.2. Constructing Prognose Prediction Models

For the prognose prediction models, we constructed a feedforward neural network, which contained one input layer, one hidden layer with 100 nodes, and one output layer. The deep features of the h i d d e n _ e n c o d e 3 in the DAE model were used as the input of the NN model. This procedure can be formulated as below:
h i d d e n l a y e r = R e L U ( W 4 × h i d d e n e n c o d e 3 + b 4 ) o u t p u t = s o f t m a x ( W 5 × h i d d e n _ l a y e r + b 5 )
where W 4 and W 5 are the weight metrics between the layers with the size of 100   ×   100 and 100 × C, respectively; C is the size of output or number of class types; b 4 and b 5 are the biases for each node; and R e L U and s o f t m a x are non-linear activation functions. The s o f t m a x activation function computes softmax cross entropy between logits and labels, and the sum of its outputs to 1 makes an efficient probability analysis. A dropout layer was added after the h i d d e n _ l a y e r , which randomly set 20% of the output of that layer to 0. The NN has a loss function to handle classification error which can measure the error between the true class and prediction class and also addresses the class imbalance. The NN model employed the focal loss as its loss function. The focal loss addresses the class balance problem by reshaping the standard cross-entropy loss such that it down-weighs the loss assigned to well-classified examples.

2.2.3. Training the Models

The cost function L was used to measure the difference between the input and the output:
L D A E ( i n p u t ,   r e c o n s t r u c t e d i n p u t ) = M S E   l o s s L N N ( h i d d e n _ e n c o d e 3 ,   o u t p u t ) = f o c a l   l o s s L ( i n p u t ,   o u t p u t ) = L D A E + L N N
For the optimization, we selected Adam optimizer [34], which has several arguments to be set freely, as the strategy to update the weights and bias so that the minima could be found. After running different trials, the learning rate was finally set to 0.00001, and the batch size and epoch were set to 128 and 2000, respectively. The models were finally trained under the parameters mentioned above. We chose the checkpoint model which shows the lowest error on the training set. The activity values and weight metrics related to deep features were readouts.

3. Results

3.1. Visualization of Feature Extraction

The training set was utilized to analyze and extract deep features from gene biomarkers by the DAE model. We compared it with the PCA dimension reduction and LASSO feature selection methods. We extracted 100 features for each classification task for further analysis by the DAE model as shown in Table 2. For a fair comparison, we also extracted 100 features for each classification task by the PCA method as shown in Table 3. Different numbers of gene biomarkers were selected by the LASSO method as shown in Table 4. The testing set was utilized to evaluate the feature extraction from gene biomarkers. We developed the PCA and LASSO methods using Scikit-Learn and developed the DAE model using Pytorch.
For the visualization of the deep features extracted by DAE, the features extracted by PCA, and the features selected by LASSO, we used t-Distributed Stochastic Neighbor Embedding (TSNE) [35]. TSNE is a widely used non-linear dimensionality reduction technique for visualizing high-dimensional data with clear and perfect separation on the two- (or three-) dimensional plane.
We used the two-dimensional plane for the following visualizations of extracted features as shown in Figure 2, Figure 3, Figure 4 and Figure 5 for each prognosis.
The visualization of the extracted features from the gene biomarkers for the prognosis such as sample type, primary diagnosis, tumor stage, and vital status are shown in Figure 2, Figure 3, Figure 4 and Figure 5, respectively. It can be seen that the deep features extracted by the DAE model were distinguished better than the features extracted by the PCA method and the features selected by the LASSO method on both the training and testing sets. Further, other prognoses are identified by the DAE method.

3.2. Training Process

We trained our COST-HDL approach with 2000 epochs. Each loss (MSE, Focal, and Total) during the training is shown in Figure 6, Figure 7, Figure 8 and Figure 9 for each prognosis. The MSE loss continuously decreased in all experiments for each diagnosis. In the multi-class case, tumor stage prognosis, it decreased more strictly. The focal loss decreased, but it was more sensitive during the training for each prognosis. In the binary class case, sample type prognosis, it was most sensitive and between the values 0.6 and 1. This was because the model was already satisfied with 100% of performance results.

3.3. Evaluation of Prognose Prediction Models

To evaluate our COST-HDL approach, four indices namely, accuracy, precision, recall, and f1-score were employed the classification performance, and they are defined as follows.
A c c u r a c y = T P + T N T P + T N + F P + F N P r e c i s i o n = T P T P + F P R e c a l l = T P T P + F N F 1 s c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
where T P ,   T N ,   F P , and F N are the number of true positives, true negatives, false positives, and false negatives, respectively. A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class, and a false negative is an outcome where the model incorrectly predicts the negative class. In Table 5, we compared the models with different loss functions (only MSE loss, only focal loss, and total loss). It can be seen that the models with total loss show better performances than the other single loss models, and the models with only MSE loss show the worst results.
For the prediction of sample type prognosis, our COST-HDL approach with total loss achieved the highest results: 100% accuracy, 100% precision, 100% recall, and 100% f1-score. It improved the model with only focal loss by 0.43% of accuracy, 0.24% of precision, 2% of recall, and 1.14% of f1-score.
For the prediction of primary diagnosis prognosis, our COST-HDL approach with total loss achieved the highest results: 96.98% accuracy, 97.43% precision, 95.68% recall, and 96.49% f1-score. It improved the model with only focal loss by 0.43% of accuracy, 0.3% of precision, 0.67% of recall, and 0.52% of f1-score.
For the prediction of tumor stage prognosis, our COST-HDL approach with total loss achieved the highest results: 56.70% accuracy, 49.41% precision, 46.14% recall, and 46.68% f1-score. It improved the model with only focal loss by 2.24% of accuracy, 4.26% of precision, 1.09% of recall, and 2.92% of f1-score.
For the prediction of vital status prognosis, our COST-HDL approach with total loss achieved the highest results: 76.72% accuracy, 69.78% precision, 68.92% recall, and 69.32% f1-score. It improved the model with only focal loss by 0.43% of accuracy, 0.78% of precision, 1.87% of recall, and 1.49% of f1-score.
We verified whether our COST-HDL approach performs better than general traditional machine learning classifiers, such as K-Nearest Neighbors (KNN) [36], Linear Support Vector Machine (Linear SVM) [37], Kernel Support Vector Machine (Kernel SVM) [38], Random Forest (RF) [39], and Neural Network (NN) [40]. The traditional machine learning classifiers are followed by feature extraction methods such as PCA dimension reduction and LASSO feature selection. To solve the data imbalance problem, they usually employ sampling methods such as the Synthetic Minority Over-sampling Technique (SMOTE) [41], which is an over-sampling method.
Hence, in this paper, we compared our COST-HDL approach with a total loss to the traditional combination of methods: feature extraction → sampling → classifier, as shown in Table 6, Table 7, Table 8 and Table 9 for each prognosis.
For the sample type prognosis, the RF classifier with LASSO feature selection and SMOTE sampling achieved 100% accuracy, 100% precision, 100% recall, and 100% f1-score. The second-best results were 99.57% accuracy, 98.08% precision, 99.76% recall, and 98.90% f1-score achieved by the KNN and NN with LASSO feature selection and SMOTE sampling. The worst results were achieved by Kernel SVM.
For the primary diagnosis prognosis, the second-best results were 95.69% accuracy, 95.37% precision, 94.73% recall, and 95.04% f1-score achieved by the Linear SVM with LASSO feature selection and SMOTE sampling. The worst results were achieved by Kernel SVM.
For the tumor stage prognosis, the second-best results were 55.36% accuracy, 55.87% precision, 39.11% recall, and 39.07% f1-score achieved by the RF with LASSO feature selection and without SMOTE sampling. The worst results were achieved by the Linear SVM with PCA and SMOTE sampling.
For the vital status prognosis, the second-best results were 75.00% accuracy, 66.56% precision, 58.79% recall, and 59.33% f1-score achieved by the RF with LASSO feature selection and without SMOTE sampling. The worst results were achieved by the Linear SVM with PCA and SMOTE sampling.

4. Discussion and Conclusions

In this study, we showed that unsupervised non-linear DAE is an effective model to extract meaningful deep features of gene expression data from patients with kidney cancer. These features were significantly associated with the kidney cancer prognosis such as sample type, primary diagnosis, tumor stage, and vital status representing the state of patients. We also showed that the end-to-end hybrid deep learning architecture is more effective than the traditional machine learning analysis flow: feature extraction, sampling, classification.
We compared the proposed COST-HDL approach with other traditional approaches, and it achieved better results for all prognosis on gene expression data. The deep features extracted by the DAE model were distinguished better than the features extracted by the PCA method and the features selected by the LASSO method on both the training and testing sets. Further, another class label was identified by the DAE method. The results obtained can be applied to extract deep features from gene biomarkers for prognosis prediction of kidney cancer from various causes and; hence, it is useful for preventing kidney cancer and early diagnosis.
This study can be improved in three ways. The first is to develop unsupervised deep symmetric autoencoder methods such as stacking more layers, denoising, or variational functions. The second is to modify loss function which can also handle the imbalance problem, reconstruction, and classification error. The third is to improve the classifier instead of using the only neural network, and add more layers or replace existing ones by other methods such as random forest, support vector machine, k nearest neighbor, etc. Although the experimental results show that the proposed hybrid approach has the potential to improve the prognosis prediction of kidney cancer, the identification of significant biomarkers and interpretability of the deep learning model is limited in our research. In the healthcare field, interpretability is one of the primary problems with deep learning, known as black-box. The proposed approach can be extended by addressing the problem of interpretability and the human-readability of deep learning models. We will explore these ideas in future analysis.

Author Contributions

Conceptualization: H.S.S. and E.B.; methodology: H.S.S. and E.B.: software: E.B.; validation: H.S.S. and K.-A.K.; formal analysis: H.S.S.; investigation: H.S.S.; resources: H.S.S.; writing—original draft preparation: H.S.S. and E.B.; writing—review and editing: E.J.C. and K.O.K.; supervision: K.-A.K.; project administration: E.J.C. and K.O.K.; funding acquisition: K.-A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This research was supported and funded by the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning [NRF-2017R1D1A1B03030157] and [NRF2019R1F1A1051569].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xiong, H.Y.; Alipanahi, B.; Lee, L.J.; Bretschneider, H.; Merico, D.; Yuen, R.K.C.; Hua, Y.; Gueroussov, S.; Najafabadi, H.S.; Hughes, T.R.; et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 2015, 347, 1254806. [Google Scholar] [CrossRef] [PubMed]
  2. Korean National Cancer Center. Available online: https://www.ncc.re.kr (accessed on 23 November 2019).
  3. Câmara, N.O.S.; Iseki, K.; Kramer, H.; Liu, Z.H.; Sharma, K. Kidney disease and obesity: Epidemiology, mechanisms and treatment. Nat. Rev. Nephrol. 2017, 13, 181–190. [Google Scholar] [CrossRef] [PubMed]
  4. D’Angelo, G.; Pilla, R.; Tascini, C.; Rampone, S. A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. 2019, 23, 11775–11791. [Google Scholar] [CrossRef]
  5. Bejnordi, B.E.; Veta, M.; Van Diest, P.J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J.A.W.M.; Hermsen, M.; Manson, Q.F.; Balkenhol, M.; et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017, 318, 2199–2210. [Google Scholar] [CrossRef] [PubMed]
  6. Amgad, M.; Elfandy, H.; Hussein, H.; Atteya, L.A.; Elsebaie, M.A.; Abo Elnasr, L.S.; Sakr, R.A.; Salem, H.S.; Ismail, A.F.; Saad, A.M.; et al. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics 2019, 35, 3461–3467. [Google Scholar] [CrossRef] [PubMed]
  7. Kim, B.J.; Kim, S.H. Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method. Proc. Natl. Acad. Sci. USA 2018, 115, 1322–1327. [Google Scholar] [CrossRef]
  8. Chen, Y.; Li, Y.; Narayan, R.; Subramanian, A.; Xie, X. Gene expression inference with deep learning. Bioinformatics 2016, 32, 1832–1839. [Google Scholar] [CrossRef]
  9. Ferroni, P.; Zanzotto, F.M.; Riondino, S.; Scarpato, N.; Guadagni, F.; Roselli, M. Breast cancer prognosis using a machine learning approach. Cancers 2019, 11, 328. [Google Scholar] [CrossRef]
  10. Chen, M.; Hao, Y.; Hwang, K.; Wang, L.; Wang, L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access 2017, 5, 8869–8879. [Google Scholar] [CrossRef]
  11. Muhamed Ali, A.; Zhuang, H.; Ibrahim, A.; Rehman, O.; Huang, M.; Wu, A. A machine learning approach for the classification of kidney cancer subtypes using miRNA genome data. Appl. Sci. 2018, 8, 2422. [Google Scholar] [CrossRef]
  12. Aljouie, A.; Patel, N.; Roshan, U. Cross-validation and cross-study validation of kidney cancer with machine learning and whole exome sequences from the National Cancer Institute. In Proceedings of the 2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), St. Louis, MO, USA, 30 May–2 June 2018; pp. 1–6. [Google Scholar]
  13. Ing, N.; Huang, F.; Conley, A.; You, S.; Ma, Z.; Klimov, S.; Ohe, C.; Yuan, X.; Amin, M.B.; Figlin, R.; et al. A novel machine learning approach reveals latent vascular phenotypes predictive of renal cancer outcome. Sci. Rep. 2017, 7, 13190. [Google Scholar] [CrossRef]
  14. Kourou, K.; Exarchos, T.P.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [Google Scholar] [CrossRef] [PubMed]
  15. Wolf, F.A.; Angerer, P.; Theis, F.J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018, 19, 15. [Google Scholar] [CrossRef]
  16. Libbrecht, M.; Noble, W. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015, 16, 321–332. [Google Scholar] [CrossRef]
  17. Zeng, W.Z.D.; Glicksberg, B.S.; Li, Y.; Chen, B. Selecting precise reference normal tissue samples for cancer research using a deep learning approach. BMC Med. Genomics 2019, 12, 21. [Google Scholar] [CrossRef]
  18. Danaee, P.; Ghaeini, R.; Hendrix, D.A. A deep learning approach for cancer detection and relevant gene identification. Pac. Symp. Biocomput. 2017, 2017, 219–229. [Google Scholar]
  19. Kim, B.H.; Yu, K.; Lee, P.C. Cancer classification of single-cell gene expression data by neural network. Bioinformatics 2019. [Google Scholar] [CrossRef]
  20. Xie, R.; Wen, J.; Quitadamo, A.; Cheng, J.; Shi, X. A deep auto-encoder model for gene expression prediction. BMC Genomics 2017, 18, 845. [Google Scholar] [CrossRef] [PubMed]
  21. Gupta, A.; Wang, H.; Ganapathiraju, M. Learning structure in gene expression data using deep architectures, with an application to gene clustering. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 9–12 November 2015; pp. 1328–1335. [Google Scholar]
  22. Genomic Data Commons Data Portal. Available online: https://portal.gdc.cancer.gov (accessed on 23 November 2019).
  23. Wang, H.; Li, B.; Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Ser. B Stat. Methodol. 2009, 71, 671–683. [Google Scholar] [CrossRef]
  24. Mortazavi, A.; Williams, B.A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5, 621. [Google Scholar] [CrossRef]
  25. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  26. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  27. Grave, E.; Joulin, A.; Cissé, M.; Jégou, H. Efficient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1302–1310. [Google Scholar]
  28. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  29. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  30. PyTorch. Available online: https://pytorch.org (accessed on 23 November 2019).
  31. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
  32. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  33. Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef]
  34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  35. Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  36. Goldberger, J.; Hinton, G.E.; Roweis, S.T.; Salakhutdinov, R.R. Neighbourhood components analysis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 513–520. [Google Scholar]
  37. Tang, Y. Deep learning using linear support vector machines. arXiv 2013, arXiv:1306.0239. [Google Scholar]
  38. Scholkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
  39. Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  40. Dreiseitl, S.; Ohno-Machado, L. Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 2002, 35, 352–359. [Google Scholar] [CrossRef]
  41. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Figure 1. Overview of COST-HDL approach. We used kidney cancer gene expression data from the TCGA portal. The Deep Auto Encoder (DAE) model is used to extract deep features from gene biomarkers as a lower-dimensional vector. The Neural Network (NN) is used to classify sample type, primary diagnosis, tumor stage, and vital status. We summed the reconstruction loss (DAE) and balanced classification loss (NN) in the cost function.
Figure 1. Overview of COST-HDL approach. We used kidney cancer gene expression data from the TCGA portal. The Deep Auto Encoder (DAE) model is used to extract deep features from gene biomarkers as a lower-dimensional vector. The Neural Network (NN) is used to classify sample type, primary diagnosis, tumor stage, and vital status. We summed the reconstruction loss (DAE) and balanced classification loss (NN) in the cost function.
Symmetry 12 00154 g001
Figure 2. Visualization of extracted features from gene biomarkers for sample type prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Figure 2. Visualization of extracted features from gene biomarkers for sample type prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Symmetry 12 00154 g002
Figure 3. Visualization of extracted features from gene biomarkers for primary diagnosis prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Figure 3. Visualization of extracted features from gene biomarkers for primary diagnosis prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Symmetry 12 00154 g003
Figure 4. Visualization of extracted features from gene biomarkers for tumor stage prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Figure 4. Visualization of extracted features from gene biomarkers for tumor stage prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Symmetry 12 00154 g004
Figure 5. Visualization of extracted features from gene biomarkers for vital status prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Figure 5. Visualization of extracted features from gene biomarkers for vital status prognosis: (a) train data extracted by PCA, (b) test data extracted by PCA, (c) train data extracted by LASSO, (d) test data extracted by LASSO, (e) train data extracted by DAE, (f) test data extracted by DAE.
Symmetry 12 00154 g005
Figure 6. Training loss for sample type prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Figure 6. Training loss for sample type prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Symmetry 12 00154 g006
Figure 7. Training loss for primary diagnosis prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Figure 7. Training loss for primary diagnosis prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Symmetry 12 00154 g007
Figure 8. Training loss for tumor stage prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates loss.
Figure 8. Training loss for tumor stage prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates loss.
Symmetry 12 00154 g008
Figure 9. Training loss for vital status prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Figure 9. Training loss for vital status prognosis: (a) MSE loss, (b) focal loss, (c) total loss. The x axis indicates the number of epochs, and the y axis indicates the loss.
Symmetry 12 00154 g009
Table 1. Number of Class Type of the dataset.
Table 1. Number of Class Type of the dataset.
Prognosis# Gene# SampleClass TypeTotalTrainTest
Sample Type58,4041149Primary Tumor1010805205
Solid Tissue Normal13911425
Primary Diagnosis58,4091157C64.9836679157
C64.132124675
Tumor Stage60,4831118Stage-I528424104
Stage-II18314538
Stage-III26120457
Stage-IV14612125
Vital Status58,4121157Alive835664171
Dead32226161
Table 2. The extracted the number of deep features from gene biomarkers by the DAE model.
Table 2. The extracted the number of deep features from gene biomarkers by the DAE model.
Prognosis# Features
Sample Type100
Primary Diagnosis100
Tumor Stage100
Vital Status100
Table 3. The extracted number of features from gene biomarkers by PCA method.
Table 3. The extracted number of features from gene biomarkers by PCA method.
Prognosis# Features
Sample Type100
Primary Diagnosis100
Tumor Stage100
Vital Status100
Table 4. The selected number of gene biomarkers by LASSO method.
Table 4. The selected number of gene biomarkers by LASSO method.
Prognosis# Gene Biomarkers
Sample Type22
Primary Diagnosis77
Tumor Stage263
Vital Status139
Table 5. Effect of loss function of the COST-HDL approach. The best results are shown in bold.
Table 5. Effect of loss function of the COST-HDL approach. The best results are shown in bold.
PrognosisLossAccuracyPrecisionRecallF1-Score
Sample TypeMSE89.1344.5750.0047.13
Focal99.5799.7698.0098.86
Total100.00100.00100.00100.00
Primary DiagnosisMSE62.9343.8647.8942.63
Focal96.5597.1395.0195.97
Total96.9897.4395.6896.49
Tumor StageMSE12.057.9226.327.31
Focal54.4645.1545.0543.76
Total56.7049.4146.1446.68
Vital StatusMSE73.7136.8550.0042.43
Focal76.2969.0067.0567.83
Total76.7269.7868.9269.32
Table 6. Evaluation of prediction models for sample type. The best results are shown in bold.
Table 6. Evaluation of prediction models for sample type. The best results are shown in bold.
ClassifierFeature SamplingAccuracyPrecisionRecallF1-Score
KNNPCANo98.7099.2894.0096.45
Yes96.5288.4696.2991.87
LASSONo98.7097.4395.7696.57
Yes99.5798.0899.7698.90
Linear SVMPCANo97.3990.3298.5493.90
Yes97.8391.6798.7894.84
LASSONo99.1396.3099.5197.83
Yes98.7094.6499.2796.80
Kernel SVMPCANo89.1344.5750.0047.13
Yes89.1344.5750.0047.13
LASSONo89.1344.5750.0047.13
Yes89.1344.5750.0047.13
RFPCANo95.6597.6780.0086.31
Yes97.3998.5888.0092.46
LASSONo99.1399.5296.0097.67
Yes100.00100.00100.00100.00
NNPCANo98.2695.5195.5195.51
Yes98.2695.5195.5195.51
LASSONo99.1396.3099.5197.83
Yes99.5798.0899.7698.90
COST-HDL100.00100.00100.00100.00
Table 7. Evaluation of prediction models for primary diagnosis. The best results are shown in bold.
Table 7. Evaluation of prediction models for primary diagnosis. The best results are shown in bold.
ClassifierFeatureSamplingAccuracyPrecisionRecallF1-Score
KNNPCANo87.0787.0182.7984.40
Yes84.9182.8282.5982.70
LASSONo88.7990.2184.0686.24
Yes89.6690.3585.7487.52
Linear SVMPCANo88.7986.6789.2887.67
Yes92.6791.3292.1591.71
LASSONo94.4094.0393.0793.53
Yes95.6995.3794.7395.04
Kernel SVMPCANo67.6733.8450.0040.36
Yes67.6733.8450.0040.36
LASSONo67.6733.8450.0040.36
Yes67.6733.8450.0040.36
RFPCANo90.5293.8585.3388.13
Yes94.8396.4592.0093.81
LASSONo92.2494.2488.3590.56
Yes94.4094.7592.3893.43
NNPCANo89.2288.5086.4787.36
Yes88.3687.7685.1386.25
LASSONo92.2491.6590.4491.01
Yes92.2492.7389.3990.79
COST-HDL96.9897.4395.6896.49
Table 8. Evaluation of prediction models for tumor stage. The best results are shown in bold.
Table 8. Evaluation of prediction models for tumor stage. The best results are shown in bold.
ClassifierFeature SamplingAccuracyPrecisionRecallF1-Score
KNNPCANo47.7738.3933.6232.66
Yes41.0733.6033.1432.91
LASSONo45.0932.2530.0728.96
Yes40.1834.2735.2434.13
Linear SVMPCANo29.9127.6127.1524.73
Yes26.3439.2132.2825.64
LASSONo40.6237.2440.4734.28
Yes50.0043.0138.2136.61
Kernel SVMPCANo46.4311.6125.0015.85
Yes46.4311.6125.0015.85
LASSONo46.4311.6125.0015.85
Yes46.4311.6125.0015.85
RFPCANo51.3451.2033.4332.12
Yes54.4648.2044.3044.77
LASSONo55.3655.8739.1139.07
Yes53.1245.4345.5644.47
NNPCANo46.8838.8938.6538.75
Yes47.3240.3640.8140.45
LASSONo41.5235.6735.3335.23
Yes45.5438.2337.9838.01
COST-HDL56.7049.4146.1446.68
Table 9. Evaluation of prediction models for vital status. The best results are shown in bold.
Table 9. Evaluation of prediction models for vital status. The best results are shown in bold.
ClassifierFeature SamplingAccuracyPrecisionRecallF1-Score
KNNPCANo70.6957.6454.8154.64
Yes65.0954.7554.7054.72
LASSONo66.3851.1550.8350.29
Yes65.5255.1054.9955.04
Linear SVMPCANo64.6657.5458.6257.71
Yes58.1952.6253.1852.05
LASSONo73.7163.4857.3857.60
Yes72.8462.3958.3858.94
Kernel SVMPCANo73.7136.8550.0042.43
Yes73.7136.8550.0042.43
LASSONo73.7136.8550.0042.43
Yes73.7136.8550.0042.43
RFPCANo73.7162.5053.1650.42
Yes70.2658.6656.6256.95
LASSONo75.0066.5658.7959.33
Yes73.2865.7366.0565.88
NNPCANo62.0753.3853.7153.41
Yes58.6253.6854.5353.04
LASSONo61.2154.8655.7654.69
Yes58.1954.2355.2953.31
COST-HDL76.7269.7868.9269.32
Back to TopTop