DeepSGP: Deep Learning for Gene Selection and Survival Group Prediction in Glioblastoma

: Glioblastoma Multiforme (GBM) is an aggressive form of glioma, exhibiting very poor survival. Genomic input, in the form of RNA sequencing data (RNA-seq), is expected to provide vital information about the characteristics of the genes that affect the Overall Survival (OS) of patients. This could have a signiﬁcant impact on treatment planning. We present a new Autoencoder (AE)-based strategy for the prediction of survival (low or high) of GBM patients, using the RNA-seq data of 129 GBM samples from The Cancer Genome Atlas (TCGA). This is a novel interdisciplinary approach to integrating genomics with deep learning towards survival prediction. First, the Differentially Expressed Genes (DEGs) were selected using EdgeR. These were further reduced using correlation-based analysis. This was followed by the application of ranking with different feature subset selection and feature extraction algorithms, including the AE. In each case, ﬁfty features were selected/extracted, for subsequent prediction with different classiﬁers. An exhaustive study for survival group prediction, using eight different classiﬁers with the accuracy and Area Under the Curve (AUC), established the superiority of the AE-based feature extraction method, called DeepSGP. It produced a very high accuracy (0.83) and AUC (0.90). Of the eight classiﬁers, using the extracted features by DeepSGP, the MLP was the best at Overall Survival (OS) prediction with an accuracy of 0.89 and an AUC of 0.97. The biological signiﬁcance of the genes extracted by the AE were also analyzed to establish their importance. Finally, the statistical signiﬁcance of the predicted output of the DeepSGP algorithm was established using the concordance index.


Introduction
Cancer is a disease of genetic instability, often associated with genes directly involved in cell growth and proliferation, differentiation, survival, and apoptosis, or indirectly involved through genes participating in cell signal transduction pathways. Research over the past decade has revealed that many of the defective genes affect the same set of molecular pathways [1]; defects in only a small number of such pathways are actually responsible for many different tumor types [2]. Therefore, therapeutics directed at specific molecular pathways (involving cancer-altered genes) can be used to effectively treat different types of tumors. Current developments in genomics have enabled molecular profiling of biological specimens by simultaneously revealing the expression levels of thousands of genes. The gene expression patterns of cancer tissues can reveal its etiology, prognosis, and response to therapy and facilitate the individualized selection of therapies. Here, dimensionality reduction and feature selection help in filtering out and focusing on only those sets of differentially expressed genes, from the diseased tissue, that can be best correlated with patient prognosis and clinical outcome [3]. This helps reduce the computational complexity of the problem and make the appropriate modeling feasible.
Gliomas constitute 70% of malignant primary brain tumors in adults [4] and are usually classified as High-Grade Gliomas (HGGs) and Low-Grade Gliomas (LGGs)-with the former being more aggressive and infiltrative. Glioblastoma Multiforme (GBM) is an aggressive form of HGG. It is one of the most common and invasive brain tumors, with a median survival of 12 to 15 months. Standard treatments are surgery followed by chemotherapy or radiotherapy [5]. Surgical removal of tumor can increase survival substantially [6]. It was shown that post surgical chemotherapy with radiation followed by adjuvant temozolomide can improve survival further [5]. However, some idea about the overall survival is definitely useful for effective treatment planning. Although standard treatment protocols help enhance the survival of GBM patients, the prediction of their survival time and the identification of (possible) causative genes can increase the efficiency of these models. Genomic features (RNA-seq), obtained invasively, can be used for survival prediction in GBM patients.
Deep Learning (DL) is a kind of representation learning, where a machine learns from raw data to automatically discover the representations needed for regression or classification [7]. It involves multiple levels (depth) of representation, such that each transforms the representation at the preceding level into one at a higher, slightly more abstract level. Some of the inherent limitations of deep learning include high computational cost and the requirement of a large number of training data. Among the different DL models, the AE [8] is used to learn efficient data encoding through representation learning and is typically employed for dimensionality reduction.
The problem of predicting OS in brain cancer patients is challenging. It depends on multiple factors such as cancer grade, molecular biomarkers, non-biological factors (tumor location, whether it was surgically removed or reduced), age, sex, and other comorbidities. Besides, the presence of censored data causes further problems. The application of deep learning models for OS prediction is limited also due to the absence of large annotated publicly available datasets [9]. This is because training a deep model from scratch generally requires a large amount of training data to generalize, while the data remain typically scarce in medical applications with expert annotation being expensive.
The prediction of the OS of patients, from multi-modal MRI using deep learning, has started to attract attention [10]. However, it has been observed that the majority of the literature on survival prediction is based on radiomics [11][12][13][14][15][16][17][18]. Reference [19] reported a very low accuracy (51.5%) on the test set even by the best performing model. In more recent times, clinical (age) and radiomic (volumetric, shape, and texture) features from multimodal MRI scans were used [20] for two-class (short and long) and three-class (short, medium, and long) survival group prediction. It may be noted that the clinical feature (age), very obviously, contributed the most towards the high accuracy reported for survival prediction. Moreover, survival analysis, employing regression on genomic data, has also gained attention, with most of the literature attempting to find a gene signature related to the survival of a patient [21][22][23][24][25][26][27][28][29]. A study to discover genes with prognostic potential for GBM patients was reported [9], using an MLP with two hidden layers. The selection of 27 significant genes was reported from the TCGA-GBM gene expression data employing survival analysis.
In this research, OS prediction was transformed into a classification problem by dividing the patients into two survival groups, viz. high and low. It was based on the determination of a threshold; by selecting an equal number of samples on either side of the median survival point [30]. This was a novel interdisciplinary approach integrating deep learning with genomics towards survival prediction.
The contributions and output of this article are as follows: • Development of a new AE-based feature extraction model (DeepSGP) for survival group prediction in glioblastoma; • A novel interdisciplinary application to survivability prediction, in terms of genes, using deep learning; • An exhaustive experimental study, including comparison with state-of-the-art (related) feature extraction and feature selection algorithms; • Quantitative evaluation to establish the statistically significant performance of the model; • Analysis of the biological significance of genes extracted by the AE.
The rest of the paper is organized as follows. The list of symbols is given in Table 1. Preprocessing is reported in Section 2. Section 3 describes the methods for the selection of the genes. Section 4 provides the details on dimensionality reduction. This is followed by the experimental results in Section 5, encompassing the comparative study with the related methods and analysis of the biological significance of the extracted genes. Finally, Section 6 concludes the article.

Preprocessing
Here, we present the dataset preparation for the simulation study, based on TCGA-GBM RNA sequencing data (https://www.cancer.gov/tcga (accessed on 8 November 2018). The Cancer Genome Atlas Program provided 174 samples containing RNA sequencing (RNA-seq) data, with the number of control and disease samples being five and onehundred sixty-nine, respectively. The control data being the normal samples, these were not found suitable for this study; hence, they were discarded. After eliminating the 40 censored data, the threshold between the high and low survival groups was selected based on the median [30] computed at 384 days on the 129 samples. Note that the sample median can be treated as a good representation of the population statistics [30]. Now, there were 64 samples (with high survival) and 65 samples (having low survival) of glioblastoma cases, to formulate the two-class framework. The raw counts of RNA-seq for these 129 samples were employed in our experiments. Here, the RNA-seq data for each sample consisted of the raw read counts of 60,483 genes, which included both protein-coding and noncoding genes.
The EdgeR package [31] (in R) was used to prepare the data. It is a Bioconductor software package that detects differentially expressed genes from normalized read count data. The raw count of the RNA sequence, with a cutoff of 200 per million, was employed for low abundance data cleaning, to estimate the effective library size of the entire data towards normalization. Genes with low expression values are basically noise and hinder the evaluation of differentially expressed genes. Next, the RNA-seq dataset was normalized by adjusting the library sizes to minimize the log-fold changes. A major step in the analysis of Differentially Expressed Genes (DEGs) is to estimate their dispersion parameters. EdgeR uses an overdispersed Poisson model to analyze the biological and statistical variances, with the empirical Bayes method being employed to normalize the degree of overdispersion across transcripts.
Differential expression analysis [32] was used to select groups of DEGs having significant quantitative changes in expression levels between the experimental groups (involving normalized read count data and statistical analysis) reporting high and low survival in glioblastoma samples. With a threshold of p < 0.002 [31] (indicating the significance of genes), a total of 877 DEGs were extracted by the above procedure. These 877 genes were considered as our dataset for subsequent feature (gene) selection or extraction.
A volcano plot [33] was generated using the differential expression analysis results obtained from EdgeR. Volcano plots are a useful means of visualizing the data in RNA-seq, proteomics, and other high-throughput experiments. It usually resembles a scatter plot where the statistical significance denoted by the p-values is plotted with the fold changes of the expression of the genes. From this plot, one can easily perceive changes in large datasets and thereby identify the best candidate genes relevant to the study. We used the Bioconductor package "EnhancedVolcano" to construct the volcano plot. Likewise, in the case of differential expression analysis, we took a p-value threshold of less than 0.002 to denote significance. Additionally, we used a log fold-change cut-off of one, signifying at least a two-fold change in the expression between the two conditions. The volcano plot is provided in Figure 1. Looking at the volcano plot, we can see that the genes marked in red are the ones that were statistically significant on account of both the p-values and their fold changes. On the other hand, the genes represented in blue depict the ones that were significant in terms of their p-values (≤0.002), measured as the magnitude of their fold changes (which were less than two folds). In other words, the genes depicted in "red" were deregulated by at least a two-fold change in expression between the two conditions (high-vs. low-GBM). In general, these two groups of genes were relevant in our study for further downstream analyses.
As some samples in the TCGA-GBM RNA sequencing dataset had a very high expression value for a particular gene, as compared to that of the majority samples in the same class, this was likely to be noise. Therefore, we did not consider log 2 (FC), where FC (Fold Change) is the ratio of the expression values of a particular gene between the two classes, as a deciding criterion during DEG selection. Low abundance cleaning based on expression count rate and a p-value less than 0.002 were the deciding criteria.
Due to high correlation among the genes of RNA-seq data, it became necessary to reduce the redundancy amongst features. Correlation was applied for dimensionality reduction. This is a statistical method that evaluates how strongly a pair of features are related. The correlation coefficient uses the standard deviation (σ) and covariance (cov) to measure the relationship between features f 1 and f 2 as: Correlation-based feature selection helps reduce the feature set while preserving the data characteristics. It helps select a subset of features that ensures minimum feature-tofeature correlation and redundancy. We selected only one gene from each correlated cluster, with the correlation coefficient > 0.7, based on the (minimum) p-value. Figure 1. Volcano plot depicting the significant genes in terms of their significance levels and fold changes in expression. Genes with p-values less than 0.002 were considered to be statistically significant ("red" and "blue" points in the plot). Of these, those with more than a two-fold expression are additionally demarcated (genes denoted in "red"). The top 5 upregulated and downregulated genes with the highest changes in expression are labeled. The genes represented in black (NS) and green (logFC) points denote the ones that were not statistically significant and were not considered for downstream analyses.
As the sample size of the dataset was very small (only 129) and the number of features was very high, employing deep learning became infeasible. Therefore, data augmentation was performed to enhance the sample size (from 129 to 804). Over-sampling was employed for this purpose [34], using the SMOTE algorithm [35]. The Synthetic Minority Oversampling TEchnique (SMOTE) improves random oversampling by generating additional samples based on the feature space characteristics of a dataset.

Selection of Genes
The selection of features can be supervised, as well as unsupervised, depending on class information availability in the data. The algorithms are typically categorized as filter and wrapper models [36], with a differing emphasis on dimensionality reduction or accuracy enhancement. The wrapper methods assess feature subsets according to their usefulness to a given predictor. However, selecting a good set of features is usually suboptimal for building a predictor, particularly in the presence of redundant variables. Since finding the best feature subset is intractable or NP-hard [37], heuristic and nondeterministic strategies are deemed more practical. It is important to select a highly informative and non-redundant feature (gene) subset for the efficiency and accuracy of decision making.
Feature selection can be performed through ranking, as well as subset selection strategies [38]. The former computes the ranking score for each gene according to its discriminative power, followed by a simple choice of the top k genes. Although preferable for high-dimensional problems, due to its computational scalability, such subsets of selected genes remain prone to redundancy. This is because the ranking score is computed independently for each gene, while completely ignoring its correlation with others. On the contrary, the feature subset selection methods focus on choosing a subset of genes that jointly possess higher discriminative power. In general, sophisticated subset selection methods demonstrate improved decision making. However, the high computational cost usually limits their applicability to high-dimensional domains. Since a combinatorial search of the optimal combination of features is computationally prohibitive, a judicious combination of both strategies becomes desirable.
In this study, EdgeR was used to find the significantly deregulated genes between the two groups under study. Prior to carrying out the statistical testing, the low abundance genes were filtered out to improve the power of the statistical testing [39] (as described in Section 2). This preprocessing was followed by a sequential combination of ranking and subset selection in order to select the relevant genes. The entire process is outlined in Figure 2.

Ranking
The simple two-tailed t-test [40], which measures the significance of the difference of the means between two distributions, was used to evaluate the discriminatory power of each individual gene in separating the two survival classes. If the features are assumed to come from normal distributions with unknown, but equal variances, the t-statistic is defined as: where µ In the absence of any consideration of the correlation between features, the redundant ones are inevitably selected, thereby affecting the classification performance. Therefore, we used this strategy to initially select the more discriminatory genes by applying a cut-off ratio p < α, where α = 0.05 is the significance level for rejecting the null hypothesis. Here, the p-value indicates the probability of the test statistic, calculated using the null distribution.

Subset Selection
In the second stage, subset selection was employed to evaluate the effectiveness of the discriminative genes selected by the t-statistic, as outlined above. Let J be the criterion function for feature selection, which measures the importance or usefulness of a feature (gene) subset with respect to classification. It is also known as the "relevance index" in the literature. Let F k be the feature that is being evaluated, C be the class label, and S be the set of features that are already selected.
Let I(X; Y) denote the mutual information between two random variables X and Y, where p (x) and p (xy) are the probability and joint probability of occurrence, respectively. We have:

1.
Mutual Information Maximization (MIM) [41]: Here, each feature is considered independently and ranked according to the score. On the basis of the rank, the required number of features is selected; 2.
Joint Mutual Information (JMI) [42]: The objective is to select a complementary subset F k of the relevant features F j ; 3.
Minimum Redundancy Maximum Relevance (MRMR) [43]: focuses on selecting feature pairs from S having the maximum relevance and minimum redundancy, with |S| denoting the cardinality of the set; 4.
Double Input Symmetrical Relevance (DISR) [44] is a modification of the criterion of Equation (4) and given by: Here, H(F k F j C) is the joint entropy;

5.
Mutual Information Feature Selection (MIFS) [45], as Equation (5), also attempts to balance the "relevance-redundancy" tradeoff in the feature. We computed: with β being configured experimentally. Note that β = 0 leads to J mim (F k ), and β = 1 |S| results in J mrmr (F k ). Since it has been experimentally observed [45] that β = 1 is often optimal, hence this was used in our study; 6.
Finally, a Consensus (CONSS) of all seven subsets, generated by Equations (3)- (9), was also considered to incorporate the aggregated effect of the algorithms. Those genes occurring in at least three of the seven selected subsets were considered, of which the top 50 genes were taken for the present study.

Dimensionality Reduction
Although feature selection can choose individual features that are important for classification, it often neglects the interfeature relationship. An individual feature may monothetically fail to represent a dataset properly, yet polythetically, its inter-relationship with other features can function better. Feature extraction encompasses projection to a new dimension for generating transformed attributes in a reduced space without losing significant information. Given a gene dataset, it is therefore possible that this relation between genes holds the key to the accurate prediction of survival in a patient. Here lies the utility of autoencoders in dimensionality reduction.

Statistical Methods
Some of the well-known, traditional feature extraction strategies include Principal Component Analysis (PCA) [48], Kernel PCA (KPCA) [49], and Independent Component Analysis (ICA) [50]. PCA extracts the dominant information from the patterns by identifying eigenvectors with the largest eigenvalues as the most relevant attributes. KPCA, on the other hand, is a nonlinear formulation of PCA. Using integral operator kernel functions, it extracts the principal components of a dataset. A nonlinear map is employed for this purpose. ICA separates a multivariate signal from some individual non-Gaussian components, such that the generated attributes remain statistically independent. It serves as an example of blind source separation.

Autoencoder
Although statistical feature selection algorithms can efficiently detect singleton features, they typically ignore the combined effect of multiple features. This intergene relationship is not properly explored in the high-dimensional genomic data. In this scenario, the latent layer of an AE allows representing a combined effect of the entire input feature set. AEs are global nonlinear dimensionality reduction methods. They capture the topology of the manifold and tend to capture more of these global characteristics than other traditional dimensionality reduction algorithms. They can thereby learn very complicated nonlinear transformations of the features that interact in a nonlinear way, as observed in the case of high-dimensional genomic data.
The autoencoder [51] is a classical unsupervised learning model used here for gene data analysis. The basic AE is depicted in Figure 3 and includes an encoder and a decoder. The encoder compresses high-dimensional data into a low-dimensional feature space. This is an effective way of dimensionality reduction [52,53], particularly for redundant gene data. The decoder is then responsible for restoring the compressed features back to their input space. Let X = [x 1 , x 2 , . . . , x m ] ∈ R n×m be the input, containing m samples with n features. The function of the encoder can be expressed as E : X ∈ R n×m −→ Y ∈ R p×m where p n and Y is the output of the hidden layer. The function of the decoder is D : Y ∈ R p×m −→X ∈ R n×m , whereX is the output of the AE. Here, E and D represent the encoder and decoder transformations, respectively. The Mean-Squared Error (MSE) optimization function of the AE (MSE AE ) can be represented as: The AE was employed to reduce the feature space in order to efficiently perform subsequent decision making in the form of survival prediction.

Experimental Results
In this section, we present the results for survival group prediction, using eight classification algorithms on the reduced feature space (involving the selection and projection of genes). Preprocessing (as in Section 2) on the RNA-seq data with EdgeR over 60,483 genes, considering 129 uncensored samples, led to the extraction of 877 DEGs. The application of correlation-based selection, with a coefficient corr = 0.7, resulted in the choice of a reduced set of 315 genes.

Selection of Genes
The 315 genes thus obtained using Equation (2) were considered for feature ranking. A subset of 50 genes was selected by each of the seven subset selection algorithms of Equations (3)-(9), from the 315 gene set. An overall summary of the seven subsets is presented in Table 2. It is clearly evident that the maximum p-value by all the algorithms was ≤0.0019, thereby establishing the statistical significance of the selections. This was followed by considering a consensus among these seven subsets of genes, such that those present in at least three of the seven subsets were selected. Of these, the first 50 genes were chosen. The results are depicted in the last row of Table 2, as CONSS. Subsets of gene IDs, selected before and after the application of the seven filters, are reported in Supplementary Materials Sections S1 (Table S1) and S2, respectively.

Extraction of Reduced Attributes
Next, the AE model DeepSGP was employed for attribute extraction. The 315 genes' set was encoded to 50 transformed attributes, selected from the latent (middle) layer of the AE consisting of two hidden layers with 200 and 100 nodes, respectively ( Figure 3). The decoded output layer can reconstruct these back to their original gene space. Multiple the AE architecture was used for this experiment (hidden layers with 200, 150; 150, 100; and 150, 75 nodes, were employed). It was experimentally observed that stepwise decrementing (and incrementing) of the number of nodes, over the consecutive hidden layers (from 200 through 100 to 50), provided improved output, as compared to that of a sudden drop. The latent layer constituted the transformed attributes. The 50 projected attributes (chosen for the parity of comparison with the other methods) formed the reduced set for further investigation.
For comparison with other feature extraction procedures (explained in Section 4.1), we also extracted the most relevant 50 attributes from the 315 genes using the PCA (variance between 50.84 and 2.03), KPCA, and ICA algorithms. Finally, eight subsets of selected features (using feature selection and consensus algorithms), one subset of encoded attributes (DeepSGP), and three sets of extracted attributes were used by the eight classifiers considered in our study. Note that in each case, the 50 attributes (or features) were selected/extracted, based on their performance, for subsequent prediction with different classifiers. By increasing or decreasing the features, an improvement in performance was not evident. Moreover, taking 50 features (or attributes) in all the cases was reasonable, particularly for maintaining parity.

Classification into Survival Groups
The capability of feature selection and dimensionality reduction (including AE), using a 50-dimensional reduced input space, was next evaluated for survival class prediction of glioblastoma patients into the low and high survival classes. Eight classifiers, viz. Logistic Regression [54] (LR), Support Vector Machine [55] (SVM, linear kernel), k Nearest Neighbors [56] (kNN, averaged over k = 3, 5, and 7), Naive Bayes [57] (NB), Multilayer Perceptron [58] (MLP, ReLU activation), Classification and Regression Tree [59] (CART with entropy), Random Forest (RF, with n_estimators = 100) [60], and XGBoost (XGB, with booster gbtree) [61] were used for this purpose. The hyperparameters were selected based on several experiments. The classifiers were tested on the seven subsets of selected genes and one consensus subset of genes along with four subsets of extracted attributes. All the experiments were implemented using Python 3.6.
The results are depicted in Tables 3 and 4, evaluated for the average Accuracy Acc and AUC, respectively. As the sample size of the dataset was very small, we did not use 10-fold cross-validation. Instead, the Leave-One-Out strategy (LOO) was used for the experiments. The metric Area Under the Curve (AUC) measured the distinguishability between the two survival classes (low and high). The algorithms considered included CIEF, DISR, ICAP, JMI, MIFS, MIM, MRMR, and Consensus (CONSS), for selection, and ICA, PCA, KPCA, and AE, for extraction. The average score of each algorithm, over all eight classifiers, is indicated in the last column of both tables. It was evident that the average accuracy of 0.83 and the AUC 0.90 were attained by the AE, as it outperformed the rest of the algorithms by a large margin. The next highest value (on average) was obtained by ICAP, with an accuracy of 0.71 and an AUC of 0.81, which was significantly lower than the proposed DeepSGP. The ICA performed the worst, with an average accuracy of only 0.61 and an average AUC of 0.71. From the perspective of a classifier, the best results were obtained using the MLP; with a 0.89 accuracy and a AUC of 0.97. The F1-score (the harmonic mean of the precision and recall) is defined as F1 = 2 × (precision × recall)/(precision + recall), where precision = True positives/(True positives + False positives) and recall = True positive/(True positive + False negative), speci f icity = (True negative/(True negative + False positive), and sensitivity = True positive/(True positive + False negative). These are depicted in Table 5. The Receiver Operating Characteristic (ROC) curve is shown in Figure 4. From Table 5 and Figure 4, it was evident that (in all cases), DeepSGP outperformed the other methods. The experiments were also performed with DeepSGP involving fewer nodes. For example, with 10, 20, 30, and 40 nodes used for testing, the average accuracy and AUC were always less than 0.80 and 0.88, respectively. It may be noted that increasing the size of the subset did not ensure better results. Subsequent analysis of the significance of the genes, decoded from the attributes selected by the AE (DeepSGP), is pursued in Section 5.4. The predictive ability of the extracted attributes from DeepSGP was estimated using the concordance index, which was found to be a reasonable 0.81. The proposed method was compared with state-of-the-art feature extraction/selection algorithms and demonstrated superiority in all cases.  Table 5. Comparative study of the F1-score, sensitivity, and specificity in the survival prediction by DeepSGP using 50 attributes. All the experiments were conducted on a computer with an Intel Xeon E5 processor using 128 GB of RAM and an NVIDIA TITAN Xp GPU. The programming language R and Python 3.6 were used for the experiments [62,63].

Biological Significance of Extracted Genes
The weights of the AE were analyzed to infer the importance of its encoded genes towards decision making. Since our model had two hidden layers, preceding the latent layer, it became difficult to directly decode the genes based on their connectivity from the latent layer. Therefore, an analogous single-layer AE was designed with 50 nodes in its latent layer. Although this smaller AE also generated a high prediction accuracy of 0.84, it was less than that by the DeepSGP version represented in Figure 3. The normalized final weights, with values above a threshold of 0.5 or below a threshold of −0.5, were considered to be the important genes. In this manner, twenty-one genes could be extracted. Their biological relevance could also be validated from the existing literature. The results are listed in Table 6, along with the corresponding references, reporting their relevance.

Comparisons with Other Feature Sets
Analogously, employing thresholds of ±0.4 and ±0.3 resulted in the extraction of 52 and 100 significant genes, respectively. The results for the accuracy and AUC, with the eight classifiers on all three sets of selected genes, viz. 21, 52, and 100, are provided in Table 7. It can be observed that the accuracy and AUC with 315 genes were not better than those obtained using the AE. This highlights the importance of the AE towards extracting significant genes for survival group prediction. It may be emphasized that while Tables 3 and 4 provide prediction evaluation in terms of 50 projected attributes (for DeepSGP with a 200 × 100 × 50 × 100 × 200 configuration), Table 7 presents the results obtained by genes extracted by analyzing the AE weight values (with a 200 × 50 × 200 configuration). It may be concluded that a judicious combination of features (genes) through DeepSGP led to improved survival group prediction.

Pathway Analysis
The shortlisted set of 100 genes was used to perform pathway analysis using the KEGG [92] and Reactome [93] pathway databases. The IDs of the subset of the shortlisted 100 genes for pathway analysis are listed in Supplementary Materials Section S3 ( Table S2). The list of identified pathways is given in Supplementary Materials Section S4 (Tables S3 and S4). We obtained only two significant pathways in the case of KEGG, while eight significant processes were obtained upon performing the analysis on the Reactome Database. Of these processes, the noteworthy ones were the activation of matrix metalloproteinases and collagen formation and degradation [94], associated with glioblastoma.

Gene Ontology Analysis
Gene Ontology analysis was carried out on these top 100 differentially expressed genes, extracted from the AE ( Table 7). The Panther (http://www.pantherdb.org/(accessed on 2 February 2020)) and GOrilla (http://cbl-gorilla.cs.technion.ac.il/(accessed on 2 February 2020)) gene ontology databases were used to perform the enrichment analysis. Fisher's exact test with FDR correction yielded no significantly enriched biological processes (i.e., FDR ≤ 0.05). On that account, we withheld any statistical analysis and looked into the biological significance of these genes independently.
The selected list revealed several genes related to the prognosis and pathogenesis of glioblastoma cells. Among these, Heparin-Binding Epidermal Growth Factor-like growth factor (HBEGF) has been reported to be expressed in several cases of malignant gliomas, and it has been shown to induce glioblastoma in mice [95]. Various subtypes of glioblastoma have been shown to be characterized by the increased expression of genes related to the extracellular matrix and invasion [96]. Accordingly, our set included cadherins PCDHA7 and PCDHB15 to be among the deregulated genes. Another extracellular matrix protein, COL22A1, has been shown to be involved in head-and-neck cancers. Spainhour and Qiu [75] identified a possible association between COL22A1 expression and survival in glioblastoma and low-grade glioma patients. Several matrix metalloproteases were found to be deregulated, including MMP7 and MMP9, which have been reported to be associated with glioblastoma. A low level of expression of MMP9 was found to be correlated with better survival rates and is considered to be a prognostic biomarker in patients with primary glioblastoma [97]. Mu et al. [98] demonstrated that PRL3 promotes invasion and proliferation in glioma cells, and this process is mediated by the increased expression of MMP7 through the ERK and JNK signaling pathways. Another family of proteins, namely A Disintegrin And Metalloproteinases (ADAMs), has been found to be involved in the progression and pathogenicity of cancers. Of these, ADAMTS14 was found to be significantly deregulated in our dataset [99]. B3GLCT, a glucosyltransferase, has been found to be highly overexpressed in gliomas, among other categories of malignancies, as seen in the TCGA dataset (https://www.proteinatlas.org/ENSG00000187676-B3GLCT/pathology) (accessed on 2 February 2020).
Bioinformatic analysis integrating both gene expression and methylation profiles in glioblastoma using TCGA RNA-seq and methylation data, followed by further validation in other TCGA gene expression and TCGA methylation datasets, revealed several genes (including RPP25) to be among the eight-gene signature that divided the GBM patients into high-risk and low-risk groups, with the high-risk individuals being characterized by lower survival rates [84]. The lncRNA FER1L4 has been established to be upregulated in gliomas, with the expression levels being significantly higher in high-grade gliomas as compared to gliomas of a lower grade [89]. Another differentially expressed lncRNA, LINC02731, has been found to have significantly elevated expression in brain tissues as compared to other tissues (https://www.ncbi.nlm.nih.gov/gene/100128239#gene-expression (accessed on 2 February 2020)); however, whether or not it has any role in glioblastoma is yet to be examined.
Using the Panther Database to examine the pathways regulated by these genes, we found that several genes were involved in the cholecystokinin receptor-mediated signaling pathway (CCKR map in GO terms). These included HBEGF, IL8, and matrix metalloproteinases (MMP- 3,7,9). The matrix metalloproteases were also found to be involved in other pathways such as the plasminogen activating cascade, which is known to be associated with glioma cell invasion [100] and the Alzheimer disease presenilin pathway. Other regulated pathways, with corresponding genes that were differentially expressed in our set, included the gonadotropin-releasing hormone receptor pathway (FOSB, MAP2K3, and AL035425), the cadherin signaling pathway (PCDHA7 and PCDHB15), the integrin signaling pathway (AL035425, MAP2K3), and the EGF receptor signaling pathway (HBEGF and MAP2K3).

Conclusions
We developed an AE-based feature extraction model, DeepSGP, for survival group prediction, producing very high-quality results. It can be inferred that the RNA-seq data had a good potential to predict the survival prognostic of GBM patients, with the appropriate selection of genes. The extraction by the AE was compared with that from several state-of-the-art feature selection and extraction algorithms. A set of eight classifiers was used for learning and predicting, based on the selected and extracted reduced attributes' set. DeepSGP outperformed all the methods on average, and was particularly good with respect to MLP. The biological significance of the genes, extracted by the AE, were also analyzed to establish their importance. The concordance index on the AE-extracted features demonstrated that the results were quite reasonable. In our study, feature selection was employed as a means of extracting the highest correlated genes. Though this method will potentially identify the most relevant genes, several genes specific to disease biology will be omitted by this strategy. Moreover, the results are yet to be implemented in clinical care. One of the major reasons is a lack of a multicentric approach in validating the prediction of the model before the implementation of these observations [101]. In the future, we will try to discover the correlation between the selected genes with the imaging phenotypes (extracted from multi-modal MR images) for a better understanding of brain cancer patients.