Identication of a novel epithelial-mesenchymal transition-related gene signature for the prognosis prediction of endometrial carcinoma

Background Endometrial cancer (EC) is the most prevalent gynecological malignant tumor worldwide. The prognosis of EC patients with distant metastasis is poor. Epithelial ‐ mesenchymal transition (EMT) plays a crucial role in tumor invasiveness and metastasis. the expression prole of EMT-related gene and their prognostic value and therapy target potential have not been systematically explored in endometrial cancer. The RNA-Seq expression prole and corresponding clinical data of EC patients were downloaded from The Cancer Genome Atlas database and the Gene Expression Omnibus (GEO) databases. A 9 EMT-related genes (EPHB2, TUFT1, CDKN2A, ONECUT2, RBP2, KLF8, E2F1, SIX1, ERBB2) prognosis risk model of EC was built by combining a machine learning algorism and multivariate Cox regression. Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data (ESTIMATE) and Cibersort was applied to acquire for tumor purity and immune cell inltration of high and low risk group. Also, response to immune checkpoint blockades (ICB) therapy and drug sensitivity were separately evaluated in TIDE and GDSC database for the EC patients. our study, we investigate the correlation between EMT related genes (ERGs) expression and clinical data in 543 endometrial cancer patients downloaded from The Cancer Genome Atlas (TCGA), and integrated the least absolute shrinkage and selection operator (LASSO) Cox regression model of the EMT-related genes signature. The patients with high EMT risk scores were strongly associated with lower overall survival. Then we validated this signature in GEO database. Afterwards, we established a nomogram to predict the prognosis by integrating clinical characteristics and the EMT gene signature. In the end, we conducted an integrated analysis of tumor immune microenvironment, immunotherapy response and drug sensitivity to high- and low-risk score group.


Introduction
Endometrial carcinoma (EC) is one of the most prevalent gynecological malignant tumors. The incidence of EC kept increasing 1.3% per year from 2007 to 2016 worldwide. According to the latest cancer statistics, it was estimated that there would be approximately 65620 new cases diagnosed and 12590 death in the United States in 2020 [1].Those patients who are diagnosed in early stage could receive appropriate treatment and achieve good prognosis, but still the prognosis of EC patients with distant metastasis was not so satisfying, with a 5-year survival rate lower than 20% [2]. It is essential to explore the underlying mechanisms of endometrial cancer metastasis and identify new biomarkers to predict the progression, prognosis, and response to treatment of EC.
Epithelial-mesenchymal transition (EMT) is a multistep process that epithelial cells develop a mesenchymal phenotype which characterized by the loss of polarity and barrier integrity, and therefor gain motility and invasive properties [3]. EMT plays a crucial role in tumor invasiveness and metastasis [4].
Accumulated evidence shows that EMT is a key process that drives cancer metastasis, enhances drug resistance, and relates to poor prognosis in multiple cancers, including endometrial cancer. A decreased level of E-cadherin was reported in endometrial cancer with an elevation of Snail and Slug nuclear expression, which signi cantly represented the EMT process and related to poor prognosis in endometrial cancer. However, the expression pro le of EMT-related genes as prognostic factor and potential as therapeutic target have not been systematically explored.
The EMT signaling pathways could be activated by several cytokines or growth factors from the focal microenvironment. The tumor cells that undergo EMT are also associated with immune exclusion and immune deviation in the tumor microenvironment (TME) [5]. Recent pan cancer analysis also revealed a strong correlation between EMT and immune activation [6]. The crosstalk between EMT and the immune cells may bring potential therapeutic opportunities. However, the impact of EMT on the tumor immune microenvironment of endometrial cancer is still unknown.
In our study, we investigate the correlation between EMT related genes (ERGs) expression and clinical data in 543 endometrial cancer patients downloaded from The Cancer Genome Atlas (TCGA), and integrated the least absolute shrinkage and selection operator (LASSO) Cox regression model of the EMTrelated genes signature. The patients with high EMT risk scores were strongly associated with lower overall survival. Then we validated this signature in GEO database. Afterwards, we established a nomogram to predict the prognosis by integrating clinical characteristics and the EMT gene signature. In the end, we conducted an integrated analysis of tumor immune microenvironment, immunotherapy response and drug sensitivity to high-and low-risk score group.

Data preparation
We acquired the EMT-related genes (ERGs) list from epithelial-mesenchymal transition gene database(http://dbemt.bioinfo-minzhao.org/dbemt2.txt), which contained 1184 epithelial-mesenchymal transition genes with experimentally veri ed information and pre-calculated regulation information for cancer metastasis [7]. High-throughput sequencing mRNA expression data of ERGs and corresponding clinical data of Endometrial carcinoma (UCEC) cohort were obtained from the TCGA. Somatic mutation data and copy number variation were also download from TCGA. Fragments Per Kilobase of transcript per Million method was applied to normalize the expression data. The gene expression dataset, GSE21882 based on GPL10422 SWEGENE H_v3.0.1 and GPL10423 SWEGENE H_v2.1.1 was downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21882). There were 45 type1 endometrial carcinoma samples included in the dataset. The raw data was download from GEO database as a data set soft le.

Differentially expressed ERGs
The "limma" package was applied to identify the differentially expressed EMT-related genes (EGRs). ERGs in TCGA-UCEC mRNA expression cohort were identi ed with FDR<0.05 and |logFC| > 1. Also "ggplot", "ggrepel", and "heatmaps" were executed in R package to perform visualization. Enrichment analysis of intersection genes GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analyses and visualization of intersection genes were performed by "clusterPro ler", "topGo", "Rgraphviz" and "pathview" R package with p-value < 0.05 as the cut-off value.

Individualized gene signature
As we acquired the ERGs and the clinical data of 539 patients in TCGA-UCEC, univariate COX regression was performed with "survival" R packages and 101 prognostic related ERGs in TCGA-UCEC were identi ed with a p-value<0.05. Next, we performed a machine learning algorithm, the LASSO algorithm with a 10,000-time cross-validation. In the end. 9 genes were identi ed as prognostic related ERGs. With these 9 prognostic related ERGs, multivariate Cox regression analysis was performed to generate an individual predictive formula. The risk score (RS)= ( stands for the regression coe cient and x stands for the expression of selected gene) based on the formula mentioned above were calculated in TCGA-UCEC cohort and veri ed in GSE21882 dataset. Based on the RS, the patient in the two datasets were divided into low risk-group and high-risk group. And overall survival, disease free survival, clinical-pathological characteristics were compared between the two groups.

Nomogram construction
Univariate analysis has identi ed that age at diagnosis, FIGO stage, differentiation grade and risk score were signi cantly associate with overall survival. The cut-off p value was set as 0.05. All the signi cant prognostic variate mentioned above were enrolled in the multivariate Cox analysis. The p-value was set as 0.1 in the multivariate Cox analysis and age at diagnosis, differentiation grade, FIGO stage, risk score was included in the nomogram construction. C-index was applied to evaluate the predictive power of the constructed nomogram for the overall survival[8]. Larger C-index was regarding to a better predictive power of the nomogram. Calibration curve was applied to determine the compliance between the predictive survival probability and the actual survival proportion after bias correction for the 3-year and 5year overall survival.
Tumor immune microenvironment analysis ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data) is one of the algorithms developed to evaluate the cell tumor composition by calculating the immune and stromal scores using Pearson's correlation coe cient. By using "estimate" R package, the immune and stromal scores were calculated based on the gene expression data of EC patients in TCGA-UCEC. The in ltration levels of 22 immune cells in each TCGA-UCEC patients were calculated through R packages "CIBERSORT" algorithm based on the gene expression levels [9].
Immunotherapy and chemotherapy response prediction CTLA-4 and PD-1/PD-L1 pathway have been proven to be crucial in immune evasion, and thereby immune check point blockades therapy targeting PD-1 or CTLA-4 were developed to improve certain patients` prognosis [10]. Tumor Immune Dysfunction and Exclusion (TIDE) algorithm were utilized to predict the clinical response to immune checkpoint blockades therapy [11]. As chemotherapy is necessary in advanced EC patients and substantial chemotherapy were still debatable after the failure of initial chemotherapy, we acquired information from the Genomics of Drug Sensitivity in Cancer (GDSC) database [12]. Clinical response to different component were predicted through R packages "pRRophetic" and the half-maximal inhibitory concentration (IC50) of the samples were predicted by ridge regression [13].

Result
Identi cation of Differentially Expressed ERGs RNA-seq and clinical follow-up data from TCGA-UCEC datasets was downloaded, which contains 543 cancer samples and 35 normal samples. 1184 EMT-related genes (ERGs) are acquired from Human EMT Database. Then, the transcriptional expression of these 1184 ERGs in TCGA-UCEC RNA-seq data was extracted and compared between normal samples and UCEC samples through "limma" package in R software (|logFC| > 1, FDR < 0.05). The results indicated that there are 236 genes signi cantly upregulated and 169 genes signi cantly downregulated in UCEC (Figures 1(a) and 1(b)).

Biological functions pathways analysis
Biological functions pathways analysis was executed of these 405 differentially expressed ERGs. Gene ontology (GO) enrichment terms are shown in Figure 2. The result of GO enrichment shows that the screened ERGs positively regulate the DNA-binding transcription activation in UCEC. KEGG pathway enrichment of these genes are shown in Figure 2b. These results indicate that the differentially expressed ERGs can promote cancer-related pathways, including Focal adhesion, EGFR tyrosine kinase inhibitor pathways, and may lead to transcriptional deregulation in cancer.

Screened for the ERGs with signi cant prognosis
The differentially expressed EMT-related genes (ERGs) in mRNA expression data of UCEC cohort were identi ed by the "limma" package in R software (FDR < 0.05, |logFC|> 1). After combining clinical data and mRNA expression data of ERGs, univariate Cox regression analyses were used to screen the differentially expressed ERGs with remarkable prognostic value(p<0.05). And 101 genes were screened with signi cant prognostic value in UCEC (Fig 3).

Establishment and validation of the prognostic EMT-related gene signature
In order to identify the EMT associated genes, a machine learning algorithm, LASSO algorithms, was performed and eventually 9 genes (EPHB2, TUFT1, CDKN2A, ONECUT2, RBP2, KLF8, E2F1, SIX1, ERBB2) were identi ed. We utilized 10,000-fold cross validation to select penalty value, lambda(λ). In our study, the optimal computed lambda ranged from 0.0413-0.0809(Fig4a). We set the λ at 0.0809 as it can generate a stricter penalty model which contain much less variates. And at last, a 9-gene LASSO regression model was constructed (Fig4b). The AUC of the constructed LASSO regression model was 0.78 when overall 5-year survival was applied and 0.78 when 2-year survival was applied. These 9 genes were selected for the next step analysis. Multivariate Cox regression analysis performed by "survival" R package was subsequently used to construct a predictive signature. The risk score of the prognostic signature was calculated and EC patients were divided into high-risk group and low-risk group. We veri ed this formula with overall survival data in both TCGA-UCEC cohort and GEO dataset (GSE21882). And we found that this 9-gene signature was signi cantly associated with OS ( Figure 5a). In GSE21882, as only survival status after 5 years could be acquired, we divided the patients into non-survivors and survivors and the RS were compared between these two groups. Results showed that RS was signi cant higher in the non-survivors group (Fig5b). In the TCGA-UCEC dataset, our analysis results showed that there was a signi cant statistical difference in the DFS between the low-risk and high-risk group (Figure5c).
Recently, the endometrial cancer molecular classi cation was introduced by TCGA and promoted by NCCN which initiated a transition toward molecular-based classi cation with signi cant prognostic value and a potential impact on the treatment of EC. There are 4 molecular subgroups: p53-abnormal (p53abn), POLE-ultramutated (POLEmut), mismatch repair-de cient (MMRd), and no speci c molecular pro le (NSMP). In our study, we explored the above molecular pro ling of the high-and low-risk group. In our study, there were higher copy number variations of BRAF, KRAS, PIK3CA, PTEN and TP53 in the high-risk group EC patients by the means of Chi-square test. Also, mutations rate of P53, PTEN and KRAS were higher in the high-risk group EC patients (Fig5d). Coordinated to the TCGA pathology and molecular analysis, high risk group EC patients had higher P53 mutation rate and poorer prognosis, The clinical pathological characteristics were compared between the two groups, including patients' age, BMI, pre-menopause/ menopause status, number of pregnancies, different pathological grades, clinical-pathological staging, clinical outcome after the initial treatment, metastasis/recurrence status. The cut off p-value was set as 0.05. And the result showed that patients in high risk group have a higher percentage in menopause status, pregnancy no less than 2 times, G3 pathological differentiation, advanced clinical-pathological staging, and recurrence or metastasis status (Figure6).

Nomogram construction and validation
After univariate and multivariate survival Cox analysis, 4 signi cant prognostic parameters were enrolled in the construction of nomogram (p<0.1 in multivariate cox regression), detailed result was showed in table1. Respective points of age at diagnosis, differentiation grade, FIGO stage and risk score can be acquired through the point scale axis (Figure 7a). The calculated C-index was 0.796(95%CI: 0.700-0.893). Calibration cures were presented in Figure7b, which showed that the predictive probability was in good accordance with the actual survival proportion.

Immune microenvironment analysis
Based on the immune and stromal score which were acquired through ESTIMATE, we analyzed their relation with the prognosis data in TCGA-UCEC cohort. Patients in TCGA cohort were classi ed into high tumor purity and low tumor purity, signi cant prognostic difference of 5-year overall survival were identi ed, with p=0.012(Fig8a). Wilcoxon rank sum test was used to compared the immune score, stromal score and tumor purity between the high risk and low risk group in the TCGA-UCEC cohort. All p values were lower than 0.05, demonstrating a signi cant difference in immune score, stromal score and tumor purity between the two groups (Fig8b). CTLA4 was known to negatively regulate T cell function through relatively unique and potentially non-overlapping molecular mechanisms [14].Based on the risk score acquired through TCGA-UCEC cohort, CTLA4 was signi cantly up regulated in the high-risk score group(Fig8c). The proportions of 22 tumor-immune cells were displayed in Fig8d, and macrophage M0 and macrophage M2 were the top 2 components in EC tumor immune microenvironment. There were signi cantly difference between the high-risk and low-risk group regrading to the immune cells proportion, and the result were show in Figure8e. And higher proportion of Macrophage M1 and lower Macrophage M2 were found in high risk group, which indicate a potential bene ce from the immune check point blockades therapy, Immune checkpoint blockades therapy and drug sensitivity prediction Nowadays immune check point blockades targeting CTLA-4 and PD-1 has emerged as a promising strategy in various malignant tumor [10]. Clinical response to immune check point blockades were estimated, and results show that T cells dysfunction was lower in the high-risk group, which indicate a more promising response to immune checkpoint blockades targeting CTLA-4 and PD-1(Fig9a), To obtain a better comprehensive analysis of chemotherapy response, drug sensitivity data were acquire from the GDSC database and the IC50 were compared. Pearson correlation between the risk score and IC50 of different chemical components were conducted, and 8 chemical components were identi ed for a negative correlation with the risk score (r<-0.3, p<0.05). Results were displayed in Fig9b. Among the 8 components, A.443654 is a selective Akt inhibitor. ABT.263 and BI 2536 both can induce apoptosis. Lower IC50 of these drugs indicates that high-risk EC patients were more sensitive to the treatment compared to the low-risk EC patients.

Discussion
It has been widely proved that EMT is an important biological process in cancer, including endometrial cancer. Speci cally, EMT is characterized by downregulation of the epithelial cell properties and activation of the mesenchymal characteristic. EMT is thought to be one of the major mechanisms that determine invasion and metastasis of cancer cells.
In this study, we collected the high-throughput data about the transcriptional expression of 543 EC samples and 35 nontumor samples with their corresponding clinical data from TCGA database. At the same time,1184 EMT-related genes (ERGs) were acquired from Human EMT Database. Among these, we screened out the differentially expressed ERGs between EC samples and nontumor endometrium samples. Then, the differentially expressed ERGs are listed and analyzed, Biological functions pathways analysis indicated that differentially expressed ERGs can promote cancer-related pathways, including Focal adhesion, EGFR tyrosine kinase inhabitor pathways, and may lead to transcriptional misregulation in cancer. The AGE-RAGE axis is involved in the onset and progression of various chronic disorders and metabolic syndrome. Several recent researches revealed an association of the AGE-RAGE signaling with different pathological conditions involved in the progression of cancer including cell proliferation, invasion, metastasis, and angiogenesis [15]. AGEs biomarkers pentosidine malondialdehyde (MDA), a lipoxidation product closely linked to AGEs, is suggested to be found in high amount in the breast cancer and lung cancer patients [16].Our ndings also imply that AGE − RAGE signaling pathway may play a role in tumorigenesis of EC for the rst time.
A novel machine learning method LASSO was used to nally identify 9 EMT-related genes, including EPHB2, TUFT1, CDKN2A, ONECUT2, RBP2, KLF8, E2F1, SIX1, ERBB2 as independent prognostic predictor, and a formula about risk score (RS) was raised through multivariate Cox regression analysis. Some of these genes, such as CDKN2A, E2F1 and ERBB2, have been characterized in endometrial cancer tumorigenesis. CDKN2A is one of the crucial defenses against cancer development in number of human cancers. CDKN2A was found to be inactivated in EC and its hypermethylation was correlated with an increased risk of EC [17,18]. The HER2 (ERBB2) gene is ampli ed in 17%-33% of carcinosarcoma, uterine serous carcinoma, and a subset of high-grade endometrioid endometrial tumors [19,20]. However, for the remaining genes, their functions have not been revealed in EC. TUFT1 was initially found to play an important role in the development and mineralization of tooth enamel, and later discovered in many cancerous tissues including pancreatic cancer, hepatocellular carcinoma. Its expression was correlates with unfavorable clinicopathologic characteristics and poor survival [21]. Recent studies have shown that TUFT1 promotes proliferation, metastasis, and epithelial mesenchymal transformation of cancer cells through the Ca2+/ PI3K/AKT pathway [22]. ONECUT2 is a member of the one cut family of transcription factors, known to be an important regulator of early retinal cell fate during development. ONECUT2 may play a critical role in tumorigenesis pan-cancer wide [22]. RBP2 histone demethylase suppresses NOTCH signaling to sustain neuroendocrine differentiation and promote small cell lung cancer tumorigenesis [23]. Krüppel-like factor 8 (KLF8) is a critical inducer of EMT and invasion. KLF8 induces EMT primarily by repressing E-cadherin transcription. KLF8 promotes human breast cancer cell invasion and metastasis by transcriptional activation of MMP9 [24]. Sine oculis-related homeobox 1 homolog (SIX1) is a transcription factor that regulates the development of many tissues and becomes reactivated or overexpressed in multiple types of human cancer. Recent research showed that SIX1 was essential for normal endometrial epithelial differentiation, and delayed DES-induced endometrial carcinogenesis by promoting basal differentiation of CK14þ/18þ cells [24]. Another study also indicated that the SIX1 oncoprotein correlated with uterine cancer. SIX1 was not present in normal endometrium but was expressed in a subset of endometrial cancers in patients who were also more likely to have late-stage diseases [25]. Further studies of the function of these genes may identify new drivers and therapeutic targets for endometrial cancer.
A prognostic EMT-related gene signature for EC specimens was constructed and EC patients were divided into high-risk and low-risk groups according to the risk score in each sample. The results showed that the risk score could successfully predict the survival of EC patients, in which the high-risk group had a signi cantly worse OS than the low-risk group. The ROC curves also indicated that the EMT-related gene signature had favorable prognostic power. All of these ndings were validated in GEO data sets. In addition, we comprehensively analyzed the relationship of the EMT related genes signature with different clinicopathological features. We found that our prognostic signature risk score was signi cantly correlated with clinicopathological data, including age, tumor grade and tumor stage. We also constructed the nomogram by the age, grade, stage and the EMT signature that showed a good performance in predicting the survival of EC patients.
Tumors could acquire the possibility of aggressiveness and metastases through EMT process [26]. Immune cells also participate in EMT via various pathways and in turn, cancer cell crosstalk with immune cells to release immunosuppressive substance to create an immunosuppressive microenvironment that promotes invasion and metastasis [27,28]. Then we explored the immune microenvironments of the highand low-risk group, and high-risk group get a lower stromal score, immune score and tumor purity. Immune in ltration analysis showed that Macrophage M2 are lower and Macrophage M1 are higher in cell fractions in the high-risk group, which indicated a proper microenvironment that enhanced T cells cytotoxic function.
There were few studies reporting immunotherapy and EMT targeted therapy in EC patients, and certain EC patients can bene t from Pembrolizumab [29].We further evaluated the likelihood of response to immunotherapy in the high and low risk group. High-risk score group expressed higher level of CTLA4, which may indicate that they are more promising targets for therapeutic immunosuppressive treatment. T cells dysfunction score by TIDE algorithm also indicated a promising response to immune checkpoint blockades therapy targeting CTLA-4/PD-1. As second line chemotherapy was still debatable in advanced EC patients, we screened out 8 components which may be crucial in the substitute chemotherapy in the GDSC database. A.443654 is a selective Akt inhibitor. PI3K/Akt/mTOR pathway is crucial in various tumor, including endometrium carcinoma. PTEN can regulate the tumor cells proliferation via mTOR pathway, and 30-60% EC exhibit a lack of PTEN. Reports showed that mTOR inhibitors can improve the disease-free survival in the advanced EC patients [30]. ABT.263 disrupts Bcl-2/Bcl-xl interactions with prodeath protein, leading to the initiation of apoptosis, BI 2536 is a Plk1 inhibitor, which can induce apoptosis and attenuates autophagy. and BI 2536 both can induce apoptosis. Lower IC50 of these drugs indicate that high-risk EC patients may be more sensitive to these chemo treatments.

Conclusions
In conclusion, we identi ed and validated a 9 EMT-related genes signature with certain prognostic value for EC patients. Immune microenvironment characters of high and low risk group were revealed, which may contribute to a better understanding for the molecular mechanism of EC and provide references for decision making during EC patients treatment. The datasets analyzed during the current study are available in The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO).Epithelial-Mesenchymal Transition gene list was acquired from Epithelial-Mesenchymal Transition GENE database(http://dbemt.bioinfo-minzhao.org/). Tumor Immune Estimation Resource database(TIMER) was implied for associated immune analysis. Immune in ltration was acquired from CIBERSOT databes(https://cibersort.stanford.edu/). Tumor Immune Dysfunction and Exclusion(TIDE) and Genomics of Drug Sensitivity in Cancer(GDSC) were applied for immune checkpoint blockades therapy and drug sensitivity prediction separately.        Immune microenvironment analysis. A )Over all survival were analysized between high and low tumor purity group, and p=0.012. b)High risk score group were associate with lower Stromal score, Immune score and tumor purity. c) Higher CTLA-4 in the high risk group. * equals p<0.05; ** equals p<0.01. d) immune cell in ltration landscape in TCGA-UCEC patients. e) difference in immune cell in ltration abundance between high-and low-risk group.

Figure 9
Immunotherapy and drug sensitivity prediction. a) T cell dysfuncion and T cell exclusion in the high-and low-risk patients with TCGA-UCEC; b) Different chemotheraputic response prediction in high-and low-risk patients.