Promising Prognosis Marker Candidates on the Status of Epithelial–Mesenchymal Transition and Glioma Stem Cells in Glioblastoma

Multivariable analyses of global expression profiling are valid indicators of the prognosis of various diseases including brain cancers. To identify the candidates for markers of prognosis of glioblastoma, we performed multivariable analyses on the status of epithelial (EPI)–mesenchymal (MES) transition (EMT), glioma (GLI) stem cells (GSCs), molecular target therapy (MTT), and potential glioma biomarkers (PGBs) using the expression data and clinical information from patients. Random forest survival and Cox proportional hazards regression analyses indicated significant variable values for DSG3, CLDN1, CDH11, FN1, HDAC3/7, PTEN, L1CAM, OLIG2, TIMP4, IGFBP2, and GFAP. The analyses also comprised prognosis prediction formulae that could distinguish between the survival curves of the glioblastoma patients. In addition to the genes mentioned above, HDAC1, FLT1, EGFR, MGMT, PGF, STAT3, SIRT1, and GADD45A constituted complex genetic interaction networks. The calculated status scores obtained by principal component analysis indicated that GLI genes covered the status of EPI, GSC, and MTT-related genes. Moreover, survival tree analyses indicated that MEShigh, MEShighGLIlow, GSChighGLIlow, MEShighMTTlow, and PGBhigh showed poor prognoses and MESmiddle, GSClow, and PGBlow showed good prognoses, suggesting that enhanced EMT and GSC are associated with poor survival and that lower expression of EPI markers and the pre-stages of EMT are relatively less malignant in glioblastoma. These results demonstrate that the assessment of EMT and GSC enables the prediction of the prognosis of glioblastoma that would help develop novel therapeutics and de novo marker candidates for the prognoses of glioblastoma.


Introduction
The World Health Organization (WHO) classifies gliomas into four categories based on malignancy and overall survival (OS) [1]. Glioblastoma is the most malignant form of astrocytoma that is fast-growing (grade 4 glioma) [1,2]. The median OS of glioblastoma is 9-15 months and the five-year survival rate remains less than 5% [1,[3][4][5]. Radiotherapy with six cycles of concomitant temozolomide, an oral alkylating agent with minimal additional toxicity, is the standard treatment after surgery in glioblastoma patients [2,4,5]. Thus, there is an immediate requirement for the early diagnosis and precise prediction of the prognosis for treatments of glioblastoma.
The invasiveness of glioblastoma depends on its high infiltration potential to invade the basement membranes of surrounding tissues [6][7][8][9]. Glioma cells are reprogrammed to have increased motility via weakened cell adhesions and a dysregulated cytoskeleton; this is known as epithelial-mesenchymal

Random Survival Forests Analysis
Random survival forests analysis was performed to determine the variable importance factors distinguishing gene expression associated with patient survival using the randomForestSRC package in R [24][25][26]. The variable importance values reflected the relative contribution of each variable to the prediction for the survival time, which was estimated by randomly permuting the values and recalculating the predictive accuracy of the model.

Cox Hazards Regression Analysis
Correlations between gene expression and survival times were evaluated by the Cox proportional hazards regression analysis using the JMP built-in module (SAS Institute Inc.) [25,27].

Survival Analysis
The Kaplan-Meier analysis was performed to estimate the survival distributions of subgroups with the log-rank test using the JMP built-in module (SAS Institute Inc.). The hazard ratio (HR) and confidence interval (CI) were calculated with a logistic regression model according to patients' survival times, which were assessed with a stepwise selection to compare the subgroups. Overall survival (OS) was defined as the date of diagnosis of glioblastoma to the date of death or last follow-up [19,20].

Graphical Lasso Estimation
Genetic interactions with hub networks among variables from gene expression were analyzed by the graphical lasso estimation of Gaussian graphical models, such as a sparse inverse covariance matrix using a lasso (L1) penalty, using the glasso package in R [28,29].

Principal Component Analysis (PCA)
Score correlation analysis was performed with formulae constituted by the first principal components derived from PCA. PCA was performed with the normalized FPKM values using the JMP built-in module (SAS Institute Inc.), which was used to classify the patients into the subgroups to estimate their prognoses in simple forms as linear combinations of expression values of the genes of interest [26].

Survival Tree Analysis
Tree-structured survivals were analyzed to identify the largest differences in the Kaplan-Meier survival curves by dividing the patients into the most appropriate subgroups, harboring variable spaces as overall survival and interval censoring according to a constant model of the response variable in each partition [30]. The survival tree analysis was performed using the rpart package in R [31].

Status of EMT and GSC Gene Expression in Glioblastoma
The aim of this study is to identify promising prognosis marker candidates of glioblastoma. As glioma is caused by EMT and maintained by glioma stem cells, in this study, we used the expression data and clinical information from 286 glioblastoma patients from TCGA and CGGA (Table S1) and focused on representative genes involved in EMT and glioma ( Figure 1, Table S2) [34][35][36][37]. Expression of the 126 genes in 151 samples from the TCGA data set was shown with hierarchical clusters represented by heatmaps, including those in EPI (14 genes; Figure 1a Figure 1f). However, distinct expression patterns capable of stratifying the patients into clusters were not identified, suggesting a possibility that differential expression of the genes requires combinations to evaluate the prognoses of glioblastoma patients. Thus, we tested whether combinations of the genes were associated with EMT and GSC in glioblastoma.

Evaluation of the Formulae for the Prediction of Prognosis of Glioblastoma
To determine patient survival, formulae were composed for 22 genes (Table S3) in a simple linear form using the sum of the integration of the coefficients and fragments per kilobase of exon per million mapped fragments (FPKM) of the normalized genes (Appendix A). Glioblastoma patients were divided into two subgroups based on the median status scores using the formulae. Subsequently, Kaplan-Meier survival curves were plotted ( Figure 2). All of the low score subgroups showed good prognoses such as EPI low (HR 0.51, 95% CI 0.32-0.81, p = 0.0039; Figure 2a Figure 2f). The CGGA data set also showed similar results, although significances were not detected ( Figure S3). These results indicate that the prognosis prediction formulae are useful to evaluate survival in glioblastoma patients. These formulae were associated with strong HRs for the GLI markers (HR 3.70), MTT genes (HR 3.22), and PGBs (HR 3.44), suggesting the success of the formulae. Gene ontology (GO) analysis (Table S4) indicated the gene sets that were associated with biological processes including proliferation (GO 0008284; p = 3.74 × 10 −7 ), metabolic process (GO 0051246; p = 8.02 × 10 −6 ), cell differentiation (GO 0030154; p = 4.81 × 10 −5 ), migration (GO 0030334; p = 7.58 × 10 −5 ), and histone H3 deacetylation (GO 0070932; p = 1.02 × 10 −4 ). These results suggest that the genes used for constructing the prognosis prediction formulae are involved in cancer development and progression.

Genetic Interaction and Network Hubs within the Genes of Interest
We examined the genetic interactions based on the expression intensities to detect the network hubs and central factors in glioblastoma ( Figure 3). The rates of composition of interacting to entry genes were relatively high (69.2%-95.8%). The genes constituting the formulae were also included in the genetic network: CLDN1 in EPI markers ( Figure (Figure 3f). Of these, a part of the networks for the biomarkers of MES, GSCs, and PGBs was also replicated in the CGGA data set ( Figure S4b,d,f). These results also suggest that the gene selection in this study was useful for predicting the prognosis of glioblastoma.

Correlation between EMT, GSCs, MTT, and PGBs
Correlations between the six processes were examined using principal component analysis. Each process score was calculated by the sum of the index of the integration of the first principal components and the normalized FPKM values (Appendix B). The correlation between processes was introduced into the scatter plot matrix with correlation coefficient (r) and statistical significances (p < 0.001; Figure 4a). Almost all the correlation scores were relatively high (r > 0.61; Figure 4a). The graphical lasso (Figure 4b) detected strong positive correlation between GLI-MTT (r = 0.88) and EPI-MES (r = 0.71), and weak positive correlation between other cellular processes like MES-PGB (r = 0.38), PGB-GSC (r = 0.30), and GSC-GLI (r = 0.25). However, in the CGGA validation data set, these correlation networks were not replicated, such as the negative correlation between EPI and the other processes ( Figure S5a) and more complex networks drawn in Figure S5b. We validated the positive correlation between GLI-MTT (r = 0.87), GSC-PGB (r = 0.43), and MES-PGB (r = 0.43), as shown in Figure S5b. These findings suggest that considering MES and GSC as malignant conditions has little meaning in predicting the prognosis of glioblastoma. However, combining GLI with markers of GSCs, MTT genes, and PGBs may enable the development of novel therapeutics and de novo candidate gene markers for predicting the prognosis of glioblastoma.

Assessment using a Combination of EMT and GSCs in the Progression of Glioma
We tested some combinations of EMT and GSC relative to the OS of glioblastoma patients using survival tree analysis ( Figure 5). Using the MES and GLI markers, MES middle and MES low GLI high showed good prognoses, whereas MES high GLI low showed a poor prognosis (Figure 5a), suggesting that the expression of mesenchymal genes could be a good indicator for a prognosis of glioblastoma. MES middle showed a stronger prognosis than MES high (Figure 5b). Using the GLI and GSC markers, GSC low showed a good prognosis, whereas GSC high GLI low showed a poor prognosis as compared with GSC high GLI high (Figure 5c), suggesting that GSC markers indicate malignant glioma cells. Moreover, PGB high and PGB middle showed poor prognoses as compared with PGB low (Figure 5d,e), thereby validating the candidate markers and formulae used in this study. MES high MTT low showed a weaker prognosis than MES middle and MES high MTT high (Figure 5f). This suggests that reduced expression of the target genes for MTT leads to poor patient survival. In contrast, the CGGA validation data set returned that EPI low led to a better prognosis than EPI middle and EPI high ( Figure S6a,b), suggesting a possibility that the overexpression of EPI genes can distinguish between the initiation of tumorigenesis and normal epithelial differentiation. Important data about GSC markers, MES markers, and PGBs could not be obtained from the CGGA data set ( Figure S6c,d). Summarizing these results, the expression of EMT markers including MES, EPI, GSC, and PGBs could indicate a poor prognosis and malignant glioblastoma. Combinations of MES, GLI, and MTT gene markers might be useful for determining the prognosis of glioblastoma.

Multiple Assessments Required for Diagnosis and Predicting Prognosis of Glioblastoma
A combined formula was constituted based on the Cox proportional hazards regression analysis and our previous results (Appendix C). The formula distinguished the Kaplan-Meier survival curves of the subgroups divided by the median scores with a significant HR of 3.08 (95% CI 2.04-4.70, p < 0.0001; Figure 6a) and the validation data set with similar results, but were not statistically significant (Figure 6b). To determine the important factors in the prognosis prediction formula based on representative genes like ESI, MES, GLI/GSC markers, and PGBs, we split the formulae used previously into several equations minus various factors (Figure 6c). Interestingly, the formulae without the GSC (∆GSC) and MES markers (∆MES) showed the opposite results in the risk ratio compared with the normal combined formula (COMMON) in both the TCGA and CGGA data sets (Figure 6d,e), suggesting that evaluating MES and GSC markers is indispensable for the combined formula. The formulae missing EPI (∆EPI, ∆EPI∆GSC, and ∆EPI∆PGB) and GLI markers (∆GLI, ∆GLI∆GSC, and ∆GLI∆PGB) also weakened and/or reversed the risk ratios as compared with COMMON in both data sets (Figure 6d,e). Thus, the appropriate combination of several factors chosen based on the progression of glioma guarantees predicting the prognosis of glioblastoma.

EMT and CSCs in Cancer Development and Progression
Cancer progression involves the induction of genetic stress and EMT, followed by CSC self-renewal, proliferation, and infiltration [38][39][40]. This process is also enhanced by the tumor microenvironment during angiogenesis and evasion of immune response and tumor cell properties, such as proliferation, migration, invasion, and drug resistance [38,40,41]. In glioma, chemo-and radiation-resistant GSCs have properties such as self-renewal and survival that contribute to tumor vasculature by transdifferentiating into endothelial cells upon induction with vascular endothelial growth factors (VEGFs) and hepatoma-derived growth factors (HDGFs) [38]. EMT is associated with infiltration as well as the generation and maintenance of CSCs [39]. EMT also contributes to tumor invasion, heterogeneity, chemoresistance, and robustness in reprogrammed gene expression that enables morphological and functional dedifferentiation into CSCs from differentiated tumor cells [39]. Dysregulation of splicing factors and tumor-specific isoforms frequently occurs in human tumors, indicating the importance of post-transcriptional regulation, including alternative splicing and/or gene fusion [38,42,43]. Thus, it is considered that EMT and stem cell differentiation, including dedifferentiation into CSCs, are closely related to cancer development and progression. Therefore, cellular processes and intrinsic alterations such as alternative splicing would provide the base for developing novel chemical therapeutics and de novo therapies against cancers [27].

Status Assessment and Prognosis Prediction using Multivariable Analyses on Gene Expression Profiling
Increased expression of PD-L1 correlates with a poor outcome of glioblastoma [44]. Profiling of PD-L1 expression is useful for determining patient survival time in glioblastoma [19,45]. A lower expression of type 2 T helper (Th2) cell genes compared with Th1 cell genes shows a good prognosis associated with lower expression of PD-L1 and PD-1 in glioblastoma, although estimating prognosis based on PD-L1 and PD-1 expression is difficult [19]. In the Th2 low subgroup of patients with glioblastoma, a reduced expression of solute carrier family 11 (proton-coupled divalent metal ion transporter), member 1 (SLC11A1; encoding natural resistance-associated macrophage protein), tumor necrosis factor (TNF) receptor superfamily member 1B (TNFRSF1B), and lymphotoxin β receptor (LTBR) also show good prognoses [19,46]. TMZ is commonly used for treating glioblastoma and its mechanism of action is well known; however, the side effects associated are largely unknown [47,48]. The TMZ-stimulated glioblastoma cells suppress the expression of pro-inflammatory cytokines in activated periphery blood mononuclear cells. This depends on the enhanced expression of PD-L1, but no other immune checkpoint genes, such as B7-H3, herpesvirus entry mediator (HVEM), and galectin-9 (LGALS9), suggesting that TMZ controls the expression of PD-L1 in glioblastoma cells, resulting in the evasion of the immune system [41]. An increased expression of B7-H3 and the gene signature comprising GATA3 and galectin-3 (LGALS3) show poor prognoses of glioblastoma [20], suggesting the involvement of the lectin family and oligosaccharide-mediated cell-cell and cell-matrix interactions on the antigen-presenting cell surface in their efficiency to understand glioblastoma progression as well as brain lymphoma [49,50]. Prediction of prognosis based on analyzing cellular processes is applicable to various tumors including other brain tumors, such as primary central nervous system lymphoma (PCNSL). The PD-L1 protein is expressed in tumor microenvironments in 52% of cases, such as immune cells and macrophages. Patients with PCNSL (4.1%) also exhibit a rare and aggressive form of extra-nodal non-Hodgkin's lymphoma [51,52]. This indicates that PD-L1 expression is detected in the surrounding and tumor cells. Thus, the assessment of cellular processes and prediction of prognosis with statistics from expression data are possible. Differential expression of PD-1 and PD-L2 and the Th1/Th2 balance enables the prediction of prognoses of PCNSL [27]. The formula comprising miR-30d, miR-93, and miR-181b is a promising marker for the prognosis of PCNSL [53]. The approaches and methods used in such studies could be applied to various diseases including brain tumors in assessing EMT and CSCs in this study. In addition, the candidates for diagnostic markers may also be detected from comparing with normal brain data.

Conclusions
In this study, we demonstrated that selective gene marker sets comprising 22 genes (Appendix A), such as DSG3, FN1, IGFBP2, CLDN1, HDAC7, and L1CAM as single gene marker candidates, are useful for predicting the prognosis of glioblastoma, based on the status assessment of EMT and GSCs that can be developed as novel target candidates for therapeutic intervention. The prognosis prediction formulae that included 12 genes (Appendix C), within a mixture of cellular processes, is effective in evaluating the survival of glioblastoma patients, which would help develop novel therapeutics and provide de novo marker candidates to predict the prognosis of glioblastoma.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A
The formulas estimating the status of EMT, glioma, glioma stem cells, molecular target therapy genes, and potential glioma biomarkers based on the Cox proportional hazard regression analysis.

Appendix B
The formulas estimating the status of EMT, glioma, glioma stem cells, molecular target therapy, and potential glioma biomarkers based on the principal component analysis.