5mC-Related lncRNAs as Potential Prognostic Biomarkers in Colon Adenocarcinoma

Simple Summary To identify the prognostic significance of 5mC-related lncRNAs in colon adenocarcinoma (COAD), we examined the expression levels and mutations of 21 5mC-regulated genes of COAD in TCGA. We also identified lncRNAs associated with 5mC regulatory genes using Pearson correlation analysis. After the least absolute shrinkage and selection operator (Lasso) Cox regression, the risk signature of 4 5mC-related lncRNAs was selected. Next, the risk signature’s predictive efficacy was proven. Moreover, the biological mechanism and potential immunotherapeutic response of this risk signature were identified. Collectively, we constructed the 5mC-related lncRNA risk signature, which could provide a novel prognostic prediction of COAD patients. Abstract Globally, colon adenocarcinoma (COAD) is one of the most frequent types of malignant tumors. About 40~50% of patients with advanced colon adenocarcinoma die from recurrence and metastasis. Long non-coding RNAs (lncRNAs) and 5-methylcytosine (5mC) regulatory genes have been demonstrated to involve in the progression and prognosis of COAD. The goal of this study was to explore the biological characteristics and potential predictive value of 5mC-related lncRNA signature in COAD. In this research, The Cancer Genome Atlas (TCGA) was utilized to obtain the expression of genes and somatic mutations in COAD, and Pearson correlation analysis was used to select lncRNAs involved in 5mC-regulated genes. Furthermore, we applied univariate Cox regression and Lasso Cox regression to construct 5mC-related lncRNA signature. Then Kaplan–Meier survival analysis, principal components analysis (PCA), receiver operating characteristic (ROC) curve, and a nomogram were performed to estimate the prognostic effect of the risk signature. GSEA was utilized to predict downstream access of the risk signature. Finally, the immune characteristics and immunotherapeutic signatures targeting this risk signature were analyzed. In the results, we obtained 1652 5mC-related lncRNAs by Pearson correlation analysis in the TCGA database. Next, we selected a risk signature that comprised 4 5mC-related lncRNAs by univariate and Lasso Cox regression. The prognostic value of the risk signature was proven. Finally, the biological mechanism and potential immunotherapeutic response of the risk signature were identified. Collectively, we constructed the 5mC-related lncRNA risk signature, which could provide a novel prognostic prediction of COAD patients.


Introduction
Colon adenocarcinoma (COAD) is the most commonly diagnosed colorectal cancer [1]. The main treatments for COAD include surgery, chemotherapy, radiation therapy, immunotherapy, etc. With the continuous development of treatment, the overall prognostic

Data Obtaining and Correlation Analysis
RNA transcriptome data and associated clinical features on the COAD patients were downloaded from the TCGA database (https://portal.gdc.cancer.gov, accessed on 20 September 2021). Patients without survival information were deleted. The expression levels of all transcripts were measured using fragments per kilobase million (FPKM) and were standardized using a log2-based transformation. In addition, based on human genome annotation datasets, genes were classified as protein-coding genes or lncRNA genes. The somatic mutation data were downloaded from the TCGA database and the copy number variants of TCGA-COAD were collected from the UCSC Xena database. As shown in the previous study [31], the 21 5mC regulators included 3 writers (DNMT1, DNMT3A, DNMT3B), 15 readers (MBD1, MBD2, MBD3, MBD4, MECP2, NEIL1, NTHL1, SMUG1, TDG, UHRF1, UHRF2, UNG, ZBTB33, ZBTB38, ZBTB4), and 3 erasers (TET1, TET2, TET3). Then, we determined the expression of the 21 5mC regulatory genes. The CNV landscape of 21 5mC regulatory genes was visualized by the R package "Rcircos." Meanwhile, the somatic mutation data were performed by the "maftools" package. We used Pearson correlation analysis to verify the association between 5mC regulatory genes and noncoding RNAs. A total of 1652 5mC-related lncRNAs were selected based on a correlation coefficient more than 0.4 and a p-value less than 0.001.

Identification and Construction of Risk Signature
Patients were randomly assigned to a training group and a testing group. The training dataset was performed to generate a signature for 5mC-related lncRNAs, the testing and complete datasets were employed to verify the created model. First, univariate Cox regression analysis was used to identify the prognostic efficacy of lncRNAs associated with 5mC. After that, we conducted Lasso penalized Cox regression analysis to discover the prognostic model in the training dataset, resulting in the establishment of a four 5mCrelated lncRNA risk signature. Next, the 5mC-related lncRNA prognostic signature in the training dataset was identified using Lasso Cox regression analysis, and a four 5mC-related lncRNA risk signature was also established. The following algorithm was performed to calculate the score for each COAD patient: Risk score = coef(lncRNA1) × expr(lncRNA1) + coef(lncRNA2) × expr(lncRNA2) + · · · + coef(lncRNAn) × expr(lncRNAn) where coef represents coefficients, coef(lncRNAn) represents the coefficient of lncRNAs associated with survival, and expr(lncRNAn) denotes lncRNAs expression. Finally, we classified the COAD patients into low-and high-risk subgroups depending on the median risk score.

Evaluation and Verification of the 5mC-Related lncRNA Prognosis Signature
The COAD patients were divided into high-risk and low-risk groups based on their prognostic risk score using the median risk score in the 5mC-related lncRNA prognosis signature. By utilizing R tools "survival", "survminer", and "timeROC", we measured the signature's accuracy. The Kaplan-Meier survival curve was performed to estimate the variation in surviving rate of high-risk and low-risk subgroups. The "Rtsne" and "ggplot2" tools were applied to achieve efficient dimensionality reduction, pattern recognition, and cluster depiction on the high-dimensional statistics of the four 5mC-related lncRNAs genes. Using the "survival" tool, we conducted univariate and Lasso Cox regression analysis to detect the individual predictive indicators. The R tool "rms" was employed to estimate the prediction accuracy of COAD individuals constructed by a nomogram.

Function Enrichment Analysis and Immunization Score Analysis
We utilized GO functional annotations and KEGG-enriched analysis to comprehend the 5mC regulating genes' underlying biologic function. Gene set enrichment analysis (GSEA) is a methodology used to compare biological processes across subgroups. Besides, we measured the immune cell expression in two subgroups based on the constructed signature using CIBERSORT.

Statistical Analyses
The analysis data of this experiment were all produced by R 3.6.0. The two-tailed t-test was performed to analyze continuous variables between the two subgroups, and the Chi-square test was conducted to evaluate categorical data. The Kaplan-Meier method was used to create the survival curves for the prognostic analysis, and log-rank tests were applied to assess the significance of differences. Associations between risk characteristics and patients' clinical outcomes were analyzed by Kaplan-Meier survival curves and Cox proportional hazards models. A p-value less than 0.05 on both sides was judged significant.

The Landscape of Genetic Variation of 5mC Regulatory Genes in COAD Patients
The detailed workflow of the construction of risk signature and subsequent analysis is shown in Figure 1. In this research, we explored the effects of 21 5mC DNA methylation regulatory genes in COAD patients. As shown in Figure 2A, we evaluated the incidence of somatic mutations in 21 5mC regulatory genes in COAD. We found that 108 of 394 (27.07%) samples carried genetic alterations in 5mC regulators, including amplifications, mutations, and deletions. The most frequent mutation was TET3, followed by TET1, DNMT1, and TET2. Additionally, the analysis of 21 5mC regulators revealed that CNV mutations were prevalent, DNMT3B, DNMT1, MECP2, and DNMT3A were amplifications of CNV. As opposed to MBD3, MBD2, UHRF1, UHRF2, and MDB1, which had prevalent deletions within the CNV ( Figure 2B). Meanwhile, the positions of 21 5mC regulators CNV alterations on chromosomes were shown in Figure 2C. On the 5mC regulator network, the interactions between the 21 5mC regulators and their prognostic significance in COAD patients were illustrated comprehensively ( Figure 2D). Further investigation illustrated that DNMT1/3A/3B, TDG, SMUG1, UNG, NEIL1, MDB3/4, ZBTB33, and UHRF1/2 were essentially upregulated in tumor samples, while TET2, MBD1, and MBD2 were downregulated in tumor samples ( Figure 2E). We conducted GO functional annotations and KEGG enrichment analysis of 21 5mC regulatory genes, revealing that the biological processes were significantly enriched ( Figure A1A,B). Then, the association between these 5mC regulatory genes and the survival of COAD sufferers was revealed utilizing Cox regression analysis. The forest plot revealed that DNMT3A and ZBTB4 could be regarded as protecting factors and were remarkably correlated with longer overall survival ( Figure A1C). Kaplan-Meier survival analysis showed that 5mC regulatory genes are associated with overall survival in COAD patients ( Figure A2A-K). These results demonstrate that 5mC regulatory genes have an essential impact on tumor progression.
were essentially upregulated in tumor samples, while TET2, MBD1, and MBD2 were downregulated in tumor samples ( Figure 2E). We conducted GO functional annotations and KEGG enrichment analysis of 21 5mC regulatory genes, revealing that the biological processes were significantly enriched ( Figure A1A,B). Then, the association between these 5mC regulatory genes and the survival of COAD sufferers was revealed utilizing Cox regression analysis. The forest plot revealed that DNMT3A and ZBTB4 could be regarded as protecting factors and were remarkably correlated with longer overall survival ( Figure  A1C). Kaplan-Meier survival analysis showed that 5mC regulatory genes are associated with overall survival in COAD patients ( Figure A2A-K). These results demonstrate that 5mC regulatory genes have an essential impact on tumor progression.

Identification of 5mC-related lncRNAs in COAD Patients
First, owing to the 5mC regulators being critical in the progression of cancer [33][34][35], we recognized 21 5mC regulatory genes by consulting the literature [31,36]. The mRNA expression of 21 5mC regulatory genes and 13413 lncRNAs were downloaded from the TCGA database. We applied the Pearson correlation analysis to characterize potentially

Identification of 5mC-Related lncRNAs in COAD Patients
First, owing to the 5mC regulators being critical in the progression of cancer [33][34][35], we recognized 21 5mC regulatory genes by consulting the literature [31,36]. The mRNA expression of 21 5mC regulatory genes and 13413 lncRNAs were downloaded from the TCGA database. We applied the Pearson correlation analysis to characterize potentially 5mC-related lncRNAs. As a result, we observed 611 lncRNAs associated with 5mC regulators with correlation coefficients > 0.4 and a p-value < 0.001. Next, we screened 5mC-related prognostic lncRNAs from the 611 5mC-related lncRNAs in the complete dataset utilizing univariate Cox regression analysis. As shown in Figure 3B, 5mC-related lncRNAs were found to be strongly correlated with overall survival in the complete dataset. Then, we depicted the levels of expression of these lncRNAs in normal tissues and tumors by heatmap. As shown in Figure 3A, 2 5mC-related lncRNAs were upregulated in normal tissues, and 6 5mC-related lncRNAs were upregulated in COAD tissues.

Construction of a Signature for 5mC-Related lncRNAs in COAD Patients
A total of 367 COAD patients were randomly assigned to one of two groups: training dataset (n = 184) or testing dataset (n = 183). Next, Lasso-penalized Cox analysis was used on the 8 prognostic lncRNAs to generate a 5mC-related lncRNA model which contains 4 lncRNAs ( Figure 3C,D). Each sample's risk score was obtained utilizing coefficients of the 4 lncRNAs. (Risk scores = AC008760.1 × 0.269428721981373 + AC138207.5 × 0.281459243411876 + AC156455.1 × 0.178004647291392 + ZEB1 − AS1 × 0.430050179556838). Based on the median prognostic risk value, COAD patients were classified into low-risk and high-risk groups. To investigate the expression of these four lncRNAs in normal colon cells and colon cancer cells, we performed RT-qPCR experiments on normal human intestinal epithelial cells and colorectal cancer cells. The results showed that all four lncRNAs were highly expressed in colorectal cancer cells, except for AC138207.5, which was lowly expressed in HCT116 cells ( Figure 3E). Altogether, these results reveal that the four lncRNAs in our signature are probably risk genes in colorectal cancer.

Validation of the 5mC-Related lncRNA Prognosis Signature
COAD patients were categorized as low-risk or high risk according to their risk score in the training dataset, testing dataset, or complete dataset. The Kaplan-Meier survival curve analysis revealed that the overall survial of COAD patients with low-risk scores was considerably longer than those of patients with high-risk scores ( Figure 4A-C). The receiver operating characteristic (ROC) curves also indicated that prediction efficiency of the 4 5mCrelated lncRNA prognosis signature was pretty good and was robust in predicting COAD prognosis ( Figure 4D-F). The distribution of risk scores, survival status, and heatmap for the four 5mC-related lncRNAs were then evaluated. Figure 5A,C,E depicted the relative expression levels of the four 5mC-related lncRNAs in each patient. The distribution of risk classes, survival status, and survival time for each patient in each risk category is depicted in Figure 5B,D,F. Principal component analysis (PCA) revealed two markedly different distribution patterns between high-risk and low-risk groups ( Figure A3A-F).

Prognostic Value of 5mC-Related lncRNA Prognosis Signature
To evaluate the relationship between the lncRNA prognostic signature and clinical characteristics by performing Kaplan-Meier survival curve analysis, we first examined the predictive significance of the 5mC-related lncRNA signature in COAD patients. COAD patients' independent prognostic characteristics include their age, gender, and stage. Among the subgroups, those classified as high-risk had significantly poorer survivability than those classified as low-risk ( Figure 6A-H). These findings revealed that the 5mC-related lncRNA signature could be used to forecast the outcome of COAD regardless of clinical characteristics. Moreover, as illustrated in Figure A4A-D, we discovered that older individuals had a significantly higher risk score than younger individuals (p < 0.05). In addition, the risk score increased with the stages (p < 0.05). Next, we further examined the correlation between risk signature and clinicopathological characteristics. In Figure A5A-D, we found that the proportion of patients aged <65 years, N0-1, and I-II stages was almost equally distributed between the two subgroups. Nonetheless, patients aged ≥65 years, N1, and in III-VI stages were overrepresented in the high-risk subgroup compared to the low-risk subgroup. (p < 0.05, Chi-square test).  older individuals had a significantly higher risk score than younger individuals (p < 0.05). In addition, the risk score increased with the stages (p < 0.05). Next, we further examined the correlation between risk signature and clinicopathological characteristics. In Figure  A5A-D, we found that the proportion of patients aged < 65 years, N0-1, and I-II stages was almost equally distributed between the two subgroups. Nonetheless, patients aged ≥ 65 years, N1, and in III-VI stages were overrepresented in the high-risk subgroup compared to the low-risk subgroup. (p < 0.05, Chi-square test).

Independent Predictive Value of the 5mC-related lncRNA Prognosis Signature
To investigate the independence of the 5mC-related lncRNA prognostic signature within clinical characteristics, we performed univariate and multivariate Cox regression analyses on the training dataset. We observed that the 5mC-related lncRNA signature could be used as a stand-alone predictive biomarker for COAD ( Figure 7A,B). To clarify the definition of a personalized overall survival prediction signature, we constructed a nomogram of COAD patients based on independent prognostic features ( Figure 7C). The calibration plots indicated that the signature's performance was consistent with 1-year, 3year, and 5-year overall survival predictions ( Figure 7D-F).

Independent Predictive Value of the 5mC-Related lncRNA Prognosis Signature
To investigate the independence of the 5mC-related lncRNA prognostic signature within clinical characteristics, we performed univariate and multivariate Cox regression analyses on the training dataset. We observed that the 5mC-related lncRNA signature could be used as a stand-alone predictive biomarker for COAD ( Figure 7A,B). To clarify the definition of a personalized overall survival prediction signature, we constructed a nomogram of COAD patients based on independent prognostic features ( Figure 7C). The calibration plots indicated that the signature's performance was consistent with 1-year, 3-year, and 5-year overall survival predictions ( Figure 7D-F).

Molecular Characteristics of the 5mC-Related lncRNA Prognosis Signature
To get a better understanding of the biological functions and signaling pathways between the different subgroups, we conducted GSEA to identify genes that were enriched in different risk groups. As illustrated in Figure 8A,B, the high-risk group's gene set was enriched for pathways associated with chromatin organization and systemic lupus erythematosus. Then, we further explored the differences in CNV between the risk groups, as shown in Figure 8C,D. The top 20 genes with the greatest genetic variation in risk groups were discovered, and the mutation rates of APC, TP53, TTN, and KARS were higher than 35% in both groups. Mutations in the SYNE1 gene were more common in patients classified as high-risk. While mutations in the PLCO gene were more common in patients classified as low-risk.

Molecular Characteristics of the 5mC-related lncRNA Prognosis Signature
To get a better understanding of the biological functions and signaling pathways between the different subgroups, we conducted GSEA to identify genes that were enriched

Estimation of the Tumor Immune Cell Types Utilizing the lncRNA Signature
The 22 different immune cells among different risk groups were analyzed utilizing the CIBERSPRT algorithm. As a result, immune cell concentrations showed significant differences among the subgroups ( Figure 9A). The risk score was subsequently shown to be positively connected with plasma and T follicular helper cells, whereas macrophages, M0 cells, and active mast cells were negatively correlated ( Figure 9B). Next, survival analysis also showed dense infiltration of macrophages M0 cells and activated mast cells correlated with a better prognosis, whereas, high levels of plasma cells were associated with worse survival (Figure A6A-C). We further examined the relationships between risk factors and immunity function. According to the findings, the high-risk subgroup contained more B cells, T cells, and macrophages ( Figure 9C). Given the fact that PD-L1 is a well-recognized biomarker for forecasting sensitivity to anti-PD-1/L1 therapy [37][38][39], we examined the levels of PD-L1 expression in various risk groups ( Figure 9D,E). These findings indicated that PD-L1 expression was strongly related to risk score and was increased in the high-risk subgroup, which implied that the risk score could act as an indicator of immunotherapy.

Discussion
Colorectal cancer, one of the most prevalent tumors, is increasingly being investigated by scholars in terms of its development and treatment [40][41][42]. Several molecular markers have been reported as predictors of prognosis and therapeutic efficacy in colon

Discussion
Colorectal cancer, one of the most prevalent tumors, is increasingly being investigated by scholars in terms of its development and treatment [40][41][42]. Several molecular markers have been reported as predictors of prognosis and therapeutic efficacy in colon cancer [43][44][45]. Recently, prognostic models of lncRNAs have come into the focus of researchers as predictors of prognosis and immunotherapeutic efficacy in a variety of tumors [46][47][48]. Epigenetic modifications also have gained increasing attention, such as the 5mC modifier genes, as a type of epigenetic modification, are regarded as a diagnostic and predictive indicator of malignancy [31,36]. Nevertheless, there are few reports on the significance of 5mC-related lncRNAs in the prognostic and diagnostics of COAD which deserves more investigation.
DNA 5mC methylation is a reversible post-transcriptional modification governed by 5mC-related regulators [49]. Recent research has found that 5mC genes regulate tumor growth and 5mC-regulated lncRNAs can be used as prognostic indicators of tumors [31,50,51]. In this research, we constructed a lncRNA signature containing 4 5mCrelated lncRNAs to predict overall survial in COAD patients. A previous study reported that lncRNA ZEB1-AS1 drives malignant progression of COAD through the miR-455-3p/PAK2 pathway [52]. In addition, other lncRNAs were found for the first time in COAD which need to be further investigated. Following that, the Kaplan-Meier survival curve and the ROC analysis revealed that the lncRNA signature has a strong prognostic value and was highly accurate in forecasting the overall survival of COAD. We also found a significantly poorer clinical phenotype in the high-risk subgroup versus the low-risk subgroup. Additionally, we found this signature's value at risk was an independent predictor of overall survival based on multivariate Cox regression analysis. A nomogram also showed well-consistency between observed and predicted overall survival rates at 1-, 3-, and 5-years. Our findings indicated that the lncRNA signature could serve as a potential efficient biomarker for COAD.
To assess the important biological functional phenotypes between the high-risk and low-risk groups based on the 5mC-related lncRNA prognosis signature, we conducted a GESA enrichment analysis. Nuclear lncRNAs have been known to play an important role in chromatin organization, transcription, and post-transcriptional gene expression in tumor cells [53]. Notably, the high-risk group was mainly enriched for chromatin organization, chromatin silencing, DNA packaging complex, negative regulation of gene expression, and nucleosome. Next, we studied gene mutations of different risk groups. Missense variants were the most prevalent, followed by nonsense and frameshift deletions. As previously reported, APC was the most frequently mutated gene in COAD [54], followed by TP53, TTN, and KARS. The highest mutational difference between the two groups was in the TTN mutation, which was more frequently detected in the high-risk subgroup than those from the low-risk subgroup (50% vs. 42%). The TTN gene is frequently mutated in a variety of tumor types, and frequent detection of TTN in solid tumors is related to a higher tumor mutation burden and a more favorable objective response to immune checkpoint blockade immunotherapy [55]. These findings showed that patients in the highrisk subgroup may have a better prognosis of response to immune checkpoint blockade treatment. Additionally, comparing to the high-risk subgroup, the low-risk subgroup had a higher mutation rate (20% vs. 13%) for PCLO which is frequently mutated in tumors, including hepatocellular carcinoma and diffuse large B-cell lymphoma [56,57]. Meanwhile, a mounting body of evidence suggests that mutations in PCLO are associated with tumor sensitivity to etoposide [58], implying that the sensitivity to chemotherapy may be different in various risk subgroups. TME has been considered as a prospective biomarker for predicting response to immunotherapeutics recently [59,60]. Understanding the TME landscape of tumors may help us to find new immunotherapeutic treatments. Our results suggested that the composition of some immune cells differs within subgroups. Plasma cells and T cells follicular helper cells were significantly plentiful in the high-risk subgroup. In contrast, macrophages M0 were more common in the low-risk subgroup. Survival curves also depicted the dense infiltration of macrophages M0 cells had a better prognosis. On the contrary, dense infiltration of plasma cells had a worse prognosis. Moreover, an increasing number of studies suggested that immunotherapy with PD-L1 and PD-1 blockade is undoubtedly a breakthrough in cancer treatment [61,62]. Our results also indicated that individuals classified as high-risk have a high level of PD-L1 expression, suggesting a potential sensitivity to anti-PD-1/L1 treatment. Taken together, we conclude that this signature has strong ability to provide a good immune biomarker for COAD.
The pathological staging is still utilized to predict the prognosis in COAD patients. However, individuals with a similar stage of the disease have significantly diverse prognoses, which indicates the limitations of the current staging systems. Therefore, there is an urgent need to find better predictive and therapeutic markers to stratify patients for treatment. The 5mC-related lncRNA signature we constructed performed well in terms of COAD patient survival prediction, which could provide a new method for prognosis prediction in COAD patients. However, we also note the shortcomings and limitations of this research. Owing to the limited availability of COAD sample data (downloaded from the TCGA database: https://portal.gdc.cancer.gov, accessed on 20 September 2021), more validation to the prognostic value of the 5mC-related lncRNA signature by external clinical datasets will be needed in the next research.

Conclusions
In conclusion, we identified four 5mC-related lncRNAs that could be employed as an independent predictive indicator for COAD. This risk signature may help to differentiate between immunological and molecular features as a potential prognostic indicator for immunotherapy.