1. Introduction
Cancer is a disease that starts with abnormal behavior and division of some cells, causing damage to other, nearby cells, resulting in a clod or tumor which, in certain cases, may cause death [
1]. Early discovery and proper treatment can reduce the chances of damage to other cells. The high mortality rate from cancer [
2] is motivating researchers to develop new methods for early cancer detection and classification. However, early detection is very complicated, because cancer cells are disordered. RNA-Seq analysis is extremely helpful in this regard.
RNA-Seq is a new and popular technique that is used to detect new isoforms and transcripts by providing more normalized and less noisy data for prediction and classification purposes [
3,
4]. The most important function of transcriptome profiling is to determine the differentially expressed genes occurring in a body or detect variations in genes at different levels [
5]. Identification and quantification in one place can be made using RNA-sequencing [
6]. RNA-Seq data are widely available from different databases, and are being used to classify diseases like breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney chromophobe, etc. [
7]. However, analyses of RNA gene expression data are quite complex because of their high dimensions, complexity, and the existence of duplications in feature values [
8]. Therefore, a need for automatic feature extraction exists, which may be addressed through machine learning (ML) and deep learning algorithms [
9].
Machine learning is a branch of artificial intelligence which is used to identify associations among data by finding underlying patterns using past experience and learning [
10]. ML is becoming indispensable in the age of mass data, given that it is becoming increasingly difficult for humans to find trends and patterns in data to predict future outcomes. Hence, machine learning is replacing humans to identify underlying patterns in data and make predictions for the future to make proper decisions. ML extracts features itself with almost zero human intervention, and then uses these features to make predictions. ML is being implemented almost everywhere. Its typical applications are in natural language processing, forecasting, flight governance, and biology to detect sequences of proteins and RNA [
11,
12].
There are certain limitations of ML-based algorithms in terms of selecting promising features from biomedical images for classification. However, these limitations are being overcome by deep learning. Deep learning is an emerging field based upon some advancements in ML. It is a technique which, without considering the in-between steps of feature extraction, tries to focus on making a conclusion based on raw data. This is why it is also named “automated feature engineering” [
13,
14]. Deep learning is being actively used in many research areas, including bioinformatics, computation medicine, image and graphical information processing, etc. [
7]. A convolutional neural network is a DL model for use with a large number of graphical images. Using weighted distribution, subsampling, and confined association techniques, CNN extracts the most relevant features and reduces the complexity of the neural network [
15].
Deep learning is being implemented in many disease identification processes, and is improving machine learning performance in the field [
16]. Multilayer perceptron (MLP) is a modern technology known as a feed-forward neural network used in deep learning to identify and classify different types of tumors [
17,
18]. A previous study lists instances in which deep learning has been used as stacked denoising autoencoders (SDAE) to transfer high dimensional noisy data to low dimensional data for the classification of breast cancer [
8]. Another study proposed and implemented a new approach named convolutional neural network for coexpression (CNNC), and the task of gene relationship inference in a supervised setting was performed [
19,
20,
21].
The differential analysis is the most significant part of RNA-Seq analyses. Conventional differential analysis methods usually match the tumor samples to the normal samples, i.e., from the same tumor type. Such a method would fail in differentiating tumor types because it lacks knowledge of other tumor types. To better understand the cause of various tumors, detailed analyses using RNA-Seq data are required [
22]. For the extraction of the most relevant features, most analyses try to identify differentially expressed genes. So, it is necessary to build a method that includes knowledge of multiple tumor types in the analysis [
16].
In spite of the fact that RNA-Seq data are beneficial for the detection of variations at the gene level, it is challenging to work with RNA-Seq data due to their spatial features. Eight DL approaches have been implemented in the present study for cancer classification from gene expression data. In this study, we use RNA-Seq data of five tumors. The numeric RNA-Seq values of multiple tumors are then converted to 2D images. Then most relevant features from these images are extracted and selected using DL, and then classified accordingly with eight DL models. The main objectives of our work are as follows:
To investigate the impact of a preprocessing step on the classification accuracy.
To examine the impact of feature engineering using DL at classification output.
To investigate the performance of eight DL algorithms for the classification of multiple tumor types, and to make comparisons with other state of the art methods.
The rest of this paper is organized as follows.
Section 2 describes related work.
Section 3 describes the materials and methods.
Section 4 presents our experimental results and discussion, while
Section 5 concludes the paper.
2. Related Work
Sterling Ramroach et al. [
23] used different machine learning algorithms to classify cancer. In their study, a dataset was downloaded from an online data portal, COSMIC, for multiple cancer types. The applied machine learning models were random forest (RF), gradient boosting machine (GBM), neural networks (NN), K nearest neighbor (KNN), and support vector machine (SVM). The authors performed multiple experiments for various cancer types and primary sites. Notably, RF achieved 100% accuracy in classification and was easy to tune compared to other algorithms.
Yawen Xiao et al. [
24] proposed a new deep learning-based, multimodel ensemble approach that uses five machine learning algorithms, i.e., KNN, SVM, DT’s, RF’s, and gradient boosting decision trees (GBDT). Their proposed strategy was applied to three types of cancer: LUAD, stomach adenocarcinoma (STAD), and BRCA. This strategy was implemented so that each classifier would be trained using the provided data to obtain predictions individually; these predictions are then applied to a multimodel ensemble approach using deep learning. This method provides more accurate results compared to those generated by an individual classifier to predict cancer.
Dincer Goksuluk et al. [
25] presented a new range of classifiers based on Voom, named “voomNSC”, “voomNBLDA”, “voomPLDA”, as well as SVM classifiers for the classification and evaluation of RNA-Sequencing data on cervical and lung cancer, as well as aging datasets. VoomNSC is based on voom transformation with the NSC method to build more accurate and robust classifiers. VoomDLDA and voomDQDA are not sparse base, which means that they use all of the features provided in the model. In contrast, voomNSC is a sparse base classifier and uses the only subset of features in the model. The results were compared with PLDA, NBLDA, NSC, and it was found that voomNSC produced the best results.
Paul Ryvkin et al. [
26] presented a novel numerical approach for CoRAL (classification of RNA by analysis of length). For this purpose, the authors took small RNA sequence datasets and sequenced them. Then, multiple preprocessing steps were performed, i.e., the dataset was passed to three trimmed adapter sequences, and a FASTQ file was generated. By matching with a reference file, reads were aligned, and results were stored in a SAM file. After this, the authors executed a mismatch rate on reads, and again, the results were added to a SAM file. After these steps, aligned matched genes were converted to a BAM file to be presented to CoRAL. CoRAL extracts important features and classifies multiple types of RNA sequences. This method not only classifies small RNA sequences, but also provides better guidance to the user.
Nour Eldeen M. Khalifa et al. [
27] proposed a novel optimized deep learning approach based on binary particle swarm optimization–decision tree (BPSO—DT) and CNN. The dataset was used in their study to classify different types of cancer, i.e., kidney renal clear cell carcinoma (KIRC), BRCA, lung squamous cell carcinoma (LUSC), lung adenocarcinoma(LUAD), and uterine corpus endometrial carcinoma(UCEC). This approach comprised three phases. The first was related to feature extraction, and BPSO was used to extract relevant features. The second phase aimed to solve the problem of overfitting data to get accurate results, and as such, was called the augmentation phase. The third and last phase was the deep CNN phase, which was used the CNN architecture of connected layers to classify types of cancer based on given data. This methodology produced more accurate results than the CNN technique.
Hamid Reza Hassanzadeh et al. [
28] put forward a new pipeline approach for predicting the survival chances of cancer patients. The proposed technique used graph-based semisupervised learning Laplacian support vector machines. This approach was used to predict the survival of kidney cancer (KIRC) and neuroblastoma (NB) patients. It comprised four steps. The first is preprocessing, in which data are analyzed and stored in feature metrics. The second step involves feature extraction, in which overfitting problems are removed. In the third, different models are trained. The final step is the adoption of a generalization strategy to check and give weight to each model according to its accuracy. This pipeline approach was compared to supervised SVM and produced more accurate results.
Jiande Wu et al. [
29] proposed the use of different machine learning algorithms for the classification of triple-negative breast cancer from nontriple negative breast cancer. For this purpose, RNA-sequencing gene expression data were downloaded from TCGA for 110 triple-negative breast cancer samples and 992 nontriple negative samples. The applied machine learning classification models were SVM, KNN, Naïve Bayes (NB), and DT. Because of the high dimensions of the data, before classification, an extra step, named feature selection, was performed to obtain the most relevant features. The accuracies of the classification task were 90%, 87%, 85%, and 87%, respectively. It is clear from the results that SVM performed better than the other approaches.
Léon-Charles Tranchevent et al. [
30] proposed a new approach for feature selection based on graphs combined with deep neural networks to anticipate the clinical outcomes of neuroblastoma patients. This approach took patient data and applied the graph-based method to extract the most relevant features. The extracted features were then used to train the DNN model. Finally, the performance of the model was recorded. Its accuracy was compared with other classifiers, namely, support vector machine and random forest, trained on the same data. The proposed methodology outperformed these classifiers in predicting patient clinical outcomes.
Joseph M. de Guia et al. [
16] proposed a deep learning model using CNN. The methodology was used for the complex problem of classifying of different types of cancer. This approach was applied to RNA-Seq data. The proposed CNN comprised an input layer, where input nodes with their specific weights were fully connected to three hidden layers, and output layers were connected to the in-between, hidden layers. This methodology provided better results compared to existing classification models like GA/kNN, BaselineCNN, random forest, and support vector machine.
Adam McDermald et al. [
31] proposed a machine learning-based tool named GeneQC (gene expression quality control) to estimate the reliability of expression levels in accurately fromRNA sequence datasets. The authors used 95 RNA sequencing datasets from a total of seven plant and animal species. GeneQC took three types of information as input. The first mapping reads a SAM file, the second a reference genome FASTA file, and the last a species specific annotation file. GeneQC implements two processes, i.e., feature extraction through Perl and mathematical representation of the features extracted thereby in the R package. Lastly, GeneQC classifies the category of reading alignment of every single genome.
Yawen Xiao et al. [
2] presented a stacked sparse, auto-encoder using a semisupervised deep learning approach. This strategy was used to predict different types of cancers, i.e., LUAD, STAD, and BRCA. This model comprised semisupervised feature extraction techniques and supervised classification techniques to handle both labeled and unlabeled data, in order to extract more precise information for cancer predictions. The proposed methodology was compared with other state of the art machine learning classifiers like SVM, RF, NN, and auto-encoders, and was shown to provide more accurate prediction results. Other research has discussed the application of technologies such as the Internet of Things (IoT), networks, software-defined networking (SDN), and wireless sensor networks (WSN) [
32,
33,
34,
35,
36,
37,
38,
39]
Boyu Lyu et al. [
22] proposed an approach converting RNA-Seq data into 2D images which were then classified by CNN. This technique was applied to 32 types of tumors for classification. The workflow was composed of preprocessing gene expression data and converting it to 2D images, before sending it to CNN. CNN was used here as a classification model. In the third step, heat maps were developed for each class, and genes that were comparable to pixels were selected with high salience in a heat map. In the final step, the pathways of the selected genes were validated. For testing and comparison purposes, SVM and RF were used; the proposed model was shown to provide better accuracy.
Brian Aevermann et al. [
40] proposed the use of feature selection and the binary manifestation technique of a random forest to identify biomarkers in high throughput sequencing. For this, the authors introduced NS-Forest version 2.0 in their study. This latest version of NS-Forest is suitable for two tasks, i.e., for downward examination and identification of active cell types. In their study, a cell with a gene expression with a clustered assignment was presented to the random forest, where important features were extracted through a Gini index. Genes were further ranked to overcome negative markers. Then, a binary expression score was used to identify top-ranked genes. To determine the minimum number of features, a threshold was used based upon a decision tree and an F-Beta score to examine possible combinations of biomarkers. To examine the performance of the method, experiments were conducted on human middle temporal gyrus (MTG).
Padideh Danaee et al. [
8] presented an approach based on deep learning to diagnose cancer and identify important genes for the detection of breast cancer. For this, a stacked denoising autoencoder (SDAE) was used for feature extraction from a breast cancer data set. To validate the results, three classification algorithms were applied, namely, ANN, SVM with a linear kernel, and radial basis function kernel. Autoencoders are basically feed-forward neural networks that, by using hidden layers, produce an output layer which is much closer to the input layer. Moreover, SDAE performs dimensionality reduction stack by stack on RNA-Seq data. For a performance evaluation, the authors compared their approach with principal component analysis and kernel principal component analysis (KPCA), and noted that SDAE outperformed both.
Yang Guo et al. [
41] proposed a new deep learning approach named boosting cascade deep forest (BCDForest) as an alternative to deep neural networks for the classification of cancer subtypes. This methodology was implemented on three microarray data sets containing adenocarcinoma, brain, and colon cancer, as well as data sets of RNA-Seq data including BRCA, GBM, Pancancers, and LUNG. This methodology worked as an ensemble of deep forests, whereby each forest was powerful in predicting the classification results. Cascade forest attempts to identify meaningful features in raw data by training and assembling decision tree-based random forests. This output was then compared with state-of-the-art classifiers, including SVM, KNN, LR, RF, and original gcforest. The authors noted that their proposed method provided more accurate results.
Table 1 provides the precise view of literature discussed above