Identification and Clinical Validation of a Novel 4 Gene-Signature with Prognostic Utility in Colorectal Cancer

Colorectal cancer (CRC) is a high burden disease with several genes involved in tumor progression. The aim of the present study was to identify, generate and clinically validate a novel gene signature to improve prediction of overall survival (OS) to effectively manage colorectal cancer. We explored The Cancer Genome Atlas (TCGA), COAD and READ datasets (597 samples) from The Protein Atlas (TPA) database to extract a total of 595 candidate genes. In parallel, we identified 29 genes with perturbations in > 6 cancers which are also affected in CRC. These genes were entered in cBioportal to generate a 17 gene panel with highest perturbations. For clinical validation, this gene panel was tested on the FFPE tissues of colorectal cancer patients (88 patients) using Nanostring analysis. Using multivariate analysis, a high prognostic score (composite 4 gene signature—DPP7/2, YWHAB, MCM4 and FBXO46) was found to be a significant predictor of poor prognosis in CRC patients (HR: 3.42, 95% CI: 1.71–7.94, p < 0.001 *) along with stage (HR: 4.56, 95% CI: 1.35–19.15, p = 0.01 *). The Kaplan-Meier analysis also segregated patients on the basis of prognostic score (log-rank test, p = 0.001 *). The external validation using GEO dataset (GSE38832, 122 patients) corroborated the prognostic score (HR: 2.7, 95% CI: 1.99–3.73, p < 0.001 *). Additionally, higher score was able to differentiate stage II and III patients (130 patients) on the basis of OS (HR: 2.5, 95% CI: 1.78–3.63, p < 0.001 *). Overall, our results identify a novel 4 gene prognostic signature that has clinical utility in colorectal cancer.


Introduction
Colorectal cancer (CRC) affects nearly 1.4 million individuals every year, which makes up to 10% of the global burden of cancer [1]. According to 2019 cancer statistics report, colorectal cancer caused third highest number of deaths due to cancer in United States [2]. The progress in early detection, surgical and chemotherapeutic interventions have significantly reduced the mortality rate, however, the high relapse and variable survival among the patients highlights the need of better prognostic biomarkers [3]. Several recent studies have identified gene expression signatures in cancer that have prognostic utility [4][5][6]. OncotypeDX [7], GeneFx Colon [8] Coloprint [9] signatures are available and are currently being evaluated independently in multiple independent cohorts [10]. There is need for new signatures as all the existing prognostic signatures have been shown to offer only a marginal clinical utility compared to conventional risk factors [11]. Further, a robust risk-gene signature is required to further assist clinicians to tailor personalized treatment for diversity of CRC patients. Over the past few years, several consortium efforts have yielded massive data on multiple types of cancers. The TCGA Research network is one such project with 2.5 petabytes of data that catalogs DNA sequences and its modifications along with transcriptome data of more than 11,000 individuals in over 30 types of cancers [12]. Building on TCGA datasets, secondary databases like TPA and cBioportal can provide hundreds of potential prognostic genes. These genes need additional validation through independent studies and our study is one such effort.
The protein atlas (TPA) database has analyzed transcriptome variation with respect to clinical outcome in 17 major cancers [13]. Another platform, the cBioportal is a graphic web interface to explore aberrations at the genetic, epigenetic and expressional level in multiple types of cancer [14]. The top hits from TPA database and cBioportal were combined to build a prognostic gene panel. The resulting 17-gene panel was internally tested on Formalin fixed, paraffin embedded (FFPE) tissues of CRC patients. In the past, FFPE tissues with clinical information have been instrumental in facilitating prognostic biomarker discovery [15,16]. Additionally, RNA molecules identified in FFPE tumor tissues have been shown to be of the same high quality as that seen in fresh frozen tissue [17]. Clinically, the overall survival analysis based on mRNA expression, has also shown consistent results between fresh frozen and FFPE tissues [18,19]. In an effort to explore differential expression between normal and tumor tissues, GEPIA (Gene Expression Profiling Interactive Analysis) database was accessed. GEPIA collates normal gene expression from normal TCGA database and GTEx Genotype-Tissue Expression (GTEx) project [20,21]. The aim of this study was to identify clinically actionable candidate genes from both TPA database and cBioportal and then to validate those genes internally and externally using FFPE tissues from CRC patient and independent GEO (Gene Expression Omnibus) datasets.

Exploratory Analysis to Build 17-Gene Panel
To identify risk genes in CRC, 595 candidate genes were accessed through TCGA database through The Human Protein Atlas. The analysis of mRNA expression z-score at a threshold of ± 2.0 revealed significant association (p < 0.05) of combined gene signature using KM analysis in cBioportal. Among 222 CRC patients, a total of 7 genes in combination showed significant alterations with PI4K2B exhibiting the most differential expression in 10% of patients (Table S1). In parallel, 29 genes with prognostic significance in 6 or more varied types of cancers were also run in cBioportal. Most altered gene expression was observed for 10 genes in COAD dataset with YWHAB exhibiting significant changes in 39.10% of CRC patients (Table S2). These 10 genes showed significant prognostic value in > 6 cancers (Table S3). The genes included in the panel are shown in Table 1 along with the comparison between tumor and normal colon gene expression.

External Validation of Prognostic Score with GEO Microarray Dataset and ROC analysis
To investigate the predictive potential of our four-gene model, an independent GEO microarray dataset (GSE38832) was acquired. The univariate and multivariate cox regression analysis of this dataset is presented in Table S6. The KM analysis of individual gene is presented in Figure 3. In

External Validation of Prognostic Score with GEO Microarray Dataset and ROC analysis
To investigate the predictive potential of our four-gene model, an independent GEO microarray dataset (GSE38832) was acquired. The univariate and multivariate cox regression analysis of this dataset is presented in Table S6. The KM analysis of individual gene is presented in Figure 3. In

External Validation of Prognostic Score with GEO Microarray Dataset and ROC analysis
To investigate the predictive potential of our four-gene model, an independent GEO microarray dataset (GSE38832) was acquired. The univariate and multivariate cox regression analysis of this dataset is presented in Table S6. The KM analysis of individual gene is presented in Figure 3. In external dataset the prognostic significance of individual genes was: YWHAB (HR 1.71, 95% CI: 1.12-2.61, p = 0.012), DPP7/2 (HR 0.45, 95% CI: 0.29-0.69, p = 0.0003), MCM4 (HR 3.37, 95% CI: 2.19-5.23, p < 0.001) and FBXO46 (HR 2.02, 95% CI: 1.10-3.69, p = 0.49). The composite prognostic score of all the four genes maintained high significance in achieving separation of lower and high surviving groups, with median of 31 vs. 69 months, respectively (HR 2.7, 95% CI: 1.99-3.73, p < 0.001 *) (Figure 4).       Additionally, ROC analysis was performed on the gene signature. In external dataset, The AUC value of survival at less than 1 year, less than 3 years and more than 3 years was found to be 0.529, 0.705 and 0.722 respectively. In Internal dataset, The AUC value of survival at >1 year, <3 year and >3 years is 0.590, 0.534 and 0.607 respectively ( Figure S1).

Validation of Prognostic Score in Combined Stage II and Stage III Patients
The combined analysis of stage II and stage III patients maintained the prognostic validity of the score. High score was found to be significant predictor of OS (HR 2.5, 95 CI: 1.78-3.63, p = 0.001 *). The KM analysis revealed median survival of a high prognostic score to be significantly less than that of a low prognostic score, 37.6 vs. 75.9 months, respectively ( Figure 5). Additionally, ROC analysis was performed on the gene signature. In external dataset, The AUC value of survival at less than 1 year, less than 3 years and more than 3 years was found to be 0.529, 0.705 and 0.722 respectively. In Internal dataset, The AUC value of survival at > 1 year, < 3 year and > 3 years is 0.590, 0.534 and 0.607 respectively ( Figure S1).

Validation of Prognostic Score in Combined Stage II and Stage III Patients
The combined analysis of stage II and stage III patients maintained the prognostic validity of the score. High score was found to be significant predictor of OS (HR 2.5, 95 CI: 1.78-3.63, p = 0.001 *). The KM analysis revealed median survival of a high prognostic score to be significantly less than that of a low prognostic score, 37.6 vs. 75.9 months, respectively ( Figure 5).

Comparison with Normal TCGA Datasets
To further explore the variations observed in our data, the differential gene expression between normal and colon adenocarcinoma dataset was accessed through GEPIA portal. YWHAB, LRRC59 and MCM4 was significantly overexpressed in tumor tissue (p < 0.05) ( Figure 6). FBXO46 showed slightly higher expression in cancer tumors but did not reach statistical significance. DPP7/2 was found to be lower in tumor tissue.

Comparison with Normal TCGA Datasets
To further explore the variations observed in our data, the differential gene expression between normal and colon adenocarcinoma dataset was accessed through GEPIA portal. YWHAB, LRRC59 and MCM4 was significantly overexpressed in tumor tissue (p < 0.05) ( Figure 6). FBXO46 showed slightly higher expression in cancer tumors but did not reach statistical significance. DPP7/2 was found to be lower in tumor tissue.

Biological Features of Significant Genes Found in This Panel
The functional role of the significant genes in this panel are presented in Table 6. YWHAB plays a role in signal transduction and cell cycle. MCM4 plays an essential role in DNA replication. DPP7/2 is associated with apoptosis. FBXO46 plays a role in cancer biogenesis and LRRC59 promotes angiogenesis and can fuel tumor growth. Table 6. Functional relevance of genes that were significantly associated with OS in CRC patients.

Gene Function and Role in Cancer References
YWHAB Signal transduction and cell cycle, genetically altered in multiple cancers [22,23]

Biological Features of Significant Genes Found in This Panel
The functional role of the significant genes in this panel are presented in Table 6. YWHAB plays a role in signal transduction and cell cycle. MCM4 plays an essential role in DNA replication. DPP7/2 is associated with apoptosis. FBXO46 plays a role in cancer biogenesis and LRRC59 promotes angiogenesis and can fuel tumor growth.

Discussion
CRC is the third deadliest cancer in the United States. It is essential to develop and validate new gene expression-based prognostic markers that can predict clinical outcomes more effectively. The present study was conducted with two goals: first, as a single biomarker is not scalable to larger population, we set out to generate a robust composite four gene prognostic score to predict survival status in CRC patients; and second, to further validate some of the massive amount of data has been generated through TCGA and other databases. Additionally, in this study, African-American and Caucasian patient's sample along with other parameters provided an opportunity to explore variations in gene expression based on various clinic-pathological characteristics. There was an effort to identify new prognostic genes as the African-American population has higher rate of incidence and mortality due to CRC [29]. This study analyzed in silico RNA seq data from TCGA and built on it to develop and experimentally validate a prognostic model through Nanostring analysis. In addition to screening of CRC prognostic genes from The Protein Atlas, genes with prognostic utility in 6 or more cancers were also included. The rationale of this top-down selection was to check the clinical significance of these genes in CRC patients. As these genes are aberrant in multiple cancers, they might be playing an important role in CRC tumorigenesis and could yield promising prognostic information. The four-gene signature, YWHAB, MCM4, FBXO46 and DPP7/2 (HR 5.39, 95% CI: 2.19-15.26, p < 0.001 *), was developed after multivariate Cox proportional hazard regression on the mRNA expression data from Nanostring analysis. In univariate Cox regression analysis, only stage showed prognostic correlation with overall survival (HR 2.9, 95% CI: 1.39-6.36, p < 0.001 *). In multivariate cox regression model, the stage and prognostic score maintained strong correlation with overall survival. Interestingly, alcohol consumption and tobacco consumption showed inverse correlation with overall survival. All the genes in the final prognostic model play a role in cancer growth and progression. Unexpectedly, 3 of the 4 genes are from the gene list with prognostic value in > 6 cancers (YWHAB, MCM4, FBXO46). This hints at the previously unidentified role of these genes in CRC tumorigenesis and prognosis. One of the genes, YWHAB, is included in metastatic-prone 54 gene signature for colorectal cancer [22]. Genetic alterations in YWHAB are observed in large scale integrated genomic analysis in multiple cancers [23]. Further, it has been revealed that B-cell translocation gene (BTG3) knockdown is related to over-expression of multiple genes including YWHAB in colorectal cancer [30]. As YWHAB is involved in multiple signaling pathways inside the cell, it might act downstream of genes like BTG3 in CRC carcinogenesis [30]. In another proteomics study, the differential expression of YWHAB was quantified using a comparative MALTI/TOF analysis in response to anti-tumor response of retinoic acids [31]. Although LRRC59 was not part of the 4 gene prognostic score it showed higher expression in tumor tissue and was found to be associated with stage and overall survival (Table S4). LRRC59 is involved in chromosomal rearrangement in multiple cancers [28]. LRRC59 binds to Fibroblast growth factor 1 (FGF1) and imports it into the nucleus [24]. FGFs are known to promote tumor angiogenesis by their synergistic action with Vascular Endothelial Growth Factor (VEGF) [25]. LRRC59 is associated with a significantly poorer prognosis in breast cancer [32]. Additionally, LRRC59 has been shown to transport CIP2A (cancer inhibitor of PP2A) into the nucleus, disrupting mitotic checkpoints and deregulating the cell cycle in prostate cancer cells [33]. The minichromosomal maintenance (MCM) proteins play an essential role in DNA replication [34]. The dysregulation of MCM proteins has been linked with cancer and has been a promising prognostic marker, especially in esophageal adenocarcinoma and pancreatic lesions [26]. DPP7/2 encodes aminopeptidases which are expressed in both quiescent lymphocytes and fibroblasts, maintaining a G 0 state and inhibiting apoptosis. As p53 regulates the DPP7/2 promoter, reduced expression is associated with cell cycle deregulation, as well as induction of c-Myc [35]. Interestingly, the inhibition of DPP7/2 induces apoptosis in resting lymphocytes but not activated lymphocytes. To this end, DPP7/2 driven apoptosis has been shown to be reliable prognostic factor in chronic lymphocytic leukemia (CLL), as CLL B-cells sensitive to DPP7/2 inhibition are in G 0 , while resistant CLL B-cells are partially activated [27]. FBXO46 has not been as thoroughly characterized as the other prognostic genes, but it has been found to be dysregulated in cancer and plays a role in biogenesis of cancer [36].
Among 17 genes that were included in this panel, the expression of YWHAB, MCM4, LRRC59 and FBXO46 was found to be elevated in tumor tissue compared to normal. Although non-significant, DPP7/2 was expressed at slightly higher levels in normal tissue. This may be due to expression being limited to only a subset of quiescent lymphocytes and fibroblasts. Patients with lower expression of DPP7/2 had poorer overall survival in our study. In correlation analysis, FBXO46 was found to be highly correlated with YWHAB (Pearson χ 2 test, p < 0.0001). In combination, they might play a significant role in CRC tumorigenesis. In another significant correlation, DPP7/2 showed negative correlation with LRRC59 (Pearson χ 2 test, p < 0.0001) and PCMT1 (Pearson χ 2 test, p < 0.0001). As the expression of DPP7/2 is downregulated in CRC tumor tissues, it shows inverse correlation with PCMT1, which has been shown to express at higher amounts in bladder cancer [37].
The prognostic score generated in this study was also evaluated for stage-specific prognostic significance. Identification of low risk patients in stage II and III is critical as several studies have found that only surgery is sufficient to cure most of the patients and chemotherapy was beneficial only for only a subset of patients [38]. If a novel prognostic method is developed, these low risk patients could be spared from toxic effects and numerous sequelae of chemotherapy. Several gene expression signature-based tests are currently being validated in larger cohorts, but multiple new signatures are continuously being reported [39][40][41]. There are several studies which have identified single gene like PDL-1, Layilin and Apolipoprotein E with prognostic significance in colorectal cancer [42][43][44]. There are several multiple gene signatures also that have been reported to divide patients on the basis of overall survival [45,46]. In this study, the utilization of a unique approach to include genes with prognostic significance in > 6 cancers added novelty to the 17 gene panel. These novel genes can assist in a more accurate prognosis of patients, especially stage II and stage III, which might not be as accurately defined through other gene panels. While databases such as Oncomine can be valuable tools, expression values might differ in tumor tissues for this prognostic gene signature, most likely due to the lack of survival data and clinical information. Our study attempts to find a consensus prognostic score after utilizing TPA, cBioportal, Nanostring and GEO datasets. To maximize the clinical impact in a specific stage, a recent study utilized a Random Forest analysis to identify 8 gene-signature for risk stratification in stage I of AJCC [47]. Our prognostic signature significantly differentiated patients based on overall survival and maintained significance for stage II stage III patients, which are prognostically difficult to differentiate. This stage specific risk score generation lends specificity to prognostic scores, increasing accuracy in the clinical setting. Future validation of these genes in larger cohorts including colorectal cancer specific functional and regulatory roles remains to be elucidated.

Data Source and Generation of 17-Gene Panel
The exploratory TCGA cohort consisted of 597 CRC patients. The extraction of 595 candidate genes for CRC was performed through The Human Protein Atlas (TPA) (https://www.proteinatlas.org) (Figures 7 and 8). The gene list was downloaded in .tsv format and was stratified on the basis of the individual gene's significance in OS prognosis of CRC. Next, these genes were screened for their combined prognostic significance in cBioportal (http://www.cbioportal.org) (Tables S1 and S2). The cBioportal is an online database with mRNA expression data derived on the Agilent microarray platform with a colon adenocarcinoma cohort of 222 samples. Genes were queried with an mRNA expression z-score threshold value of ± 2.0. Genes not reaching significant variable expression from the 595 candidate genes were removed through backward deletion, leaving 7 significantly altered genes in cBioportal (PI4K2B, PBXIP1, CHEK1, DLAT, FAM50A, KDM4B, DPP7/2) (p < 0.0001). In combination, the expression of these genes significantly differentiated CRC patients on the basis of overall survival ( Figure 8). As Multiple platforms like TPA and cBiportal helps in discovery and screening of potential candidate prognostic gene before it's validation on clinical samples. To expand the gene panel and to discover new prognostic genes, a novel strategy was utilized to include genes with aberrant expression in multiple cancers. For this a total of > 10,000 genes with prognostic significance in 17 cancers were downloaded from TPA database. Of these twenty-nine genes showed significant variable expression in 6 or more diverse types of cancer. These genes were queried in cBioportal for their significance in CRC, and the top 10 altered genes on the basis of percent altered samples, were added to the panel (YWHAB, DSG2, PCMT1, MCM4, AGFG1, E2F1, LRRC59, SLAMF6, FBXO46, ITGA5) (Tables S2 and  S3). In the initial screening of aforementioned 7 genes and 10 genes, it was made sure that individual gene was altered in >5% of cBioportal screening dataset. The role of these genes in CRC prognosis was tested using clinical and external dataset. For external validation, human expression profile dataset of an independent CRC study (GSE38832, n = 122) was downloaded from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo). The GSE38832 study was performed using an Affymetrix Human Genome U133 Plus 2.0 Array. The downloaded data was further curated for all the relevant clinical and follow-up data features. The flowchart of the entire study is depicted in (Figure 9). not reaching significant variable expression from the 595 candidate genes were removed through backward deletion, leaving 7 significantly altered genes in cBioportal (PI4K2B, PBXIP1, CHEK1, DLAT, FAM50A, KDM4B, DPP7/2) (p < 0.0001). In combination, the expression of these genes significantly differentiated CRC patients on the basis of overall survival ( Figure 8). As Multiple platforms like TPA and cBiportal helps in discovery and screening of potential candidate prognostic gene before it's validation on clinical samples. To expand the gene panel and to discover new prognostic genes, a novel strategy was utilized to include genes with aberrant expression in multiple cancers. For this a total of > 10,000 genes with prognostic significance in 17 cancers were downloaded from TPA database. Of these twenty-nine genes showed significant variable expression in 6 or more diverse types of cancer. These genes were queried in cBioportal for their significance in CRC, and the top 10 altered genes on the basis of percent altered samples, were added to the panel (YWHAB, DSG2, PCMT1, MCM4, AGFG1, E2F1, LRRC59, SLAMF6, FBXO46, ITGA5) (Tables S2 and S3). In the initial screening of aforementioned 7 genes and 10 genes, it was made sure that individual gene was altered in > 5% of cBioportal screening dataset. The role of these genes in CRC prognosis was tested using clinical and external dataset. For external validation, human expression profile dataset of an independent CRC study (GSE38832, n = 122) was downloaded from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo). The GSE38832 study was performed using an Affymetrix Human Genome U133 Plus 2.0 Array. The downloaded data was further curated for all the relevant clinical and follow-up data features. The flowchart of the entire study is depicted in (Figure 9).

Patient Characteristics
For internal validation, Formalin Fixed Paraffin Embedded (FFPE) blocks were accessed from pathology archives at the Medical College of Georgia at Augusta University, Augusta, GA 30912, USA. Under an IRB approved protocol (HAC # 611298), CRC patients with 5 years' follow-up were included in this study. A total of 88 patients from all the 4 stages fit in our inclusion criteria on the

Patient Characteristics
For internal validation, Formalin Fixed Paraffin Embedded (FFPE) blocks were accessed from pathology archives at the Medical College of Georgia at Augusta University, Augusta, GA 30912, USA. Under an IRB approved protocol (HAC # 611298), CRC patients with 5 years' follow-up were included in this study. A total of 88 patients from all the 4 stages fit in our inclusion criteria on the

Patient Characteristics
For internal validation, Formalin Fixed Paraffin Embedded (FFPE) blocks were accessed from pathology archives at the Medical College of Georgia at Augusta University, Augusta, GA 30912, USA. Under an IRB approved protocol (HAC # 611298), CRC patients with 5 years' follow-up were included in this study. A total of 88 patients from all the 4 stages fit in our inclusion criteria on the basis of survival duration after diagnosis. A total of 26 patients were administered chemotherapy after surgery and 62 patients did not receive any chemotherapy. No Informed consent from the patients was required as this was a retrospective study on de-identified FFPE samples. The patients were stratified on the basis of overall survival in two groups, with higher (patient that survived >3 years) and lower survival (patient that survived <1 year) along with American Joint Committee on Cancer (AJCC) staging system (I to IV), grade, gender, age, distant metastasis, location and vital status. Only histologically confirmed cancer patients were included in this study. The samples with insufficient documentation, lack of tumor tissue in blocks, failure of RNA isolation or highly degraded RNA were not included in this study.

FFPE Tissue Sectioning and H&E Staining
FFPE blocks were used to produce fine sections for further microscopic analysis and RNA isolation. For tissues that had rich cancerous region, only five 5 µm sections were cut and for small tissues twenty sections were cut. H&E staining was performed using standard protocol and was examined for tumor-rich regions by a board-certified pathologist.

Quantification of mRNA Molecules Using Nanostring Platform
To quantify mRNA expression of 17 genes, we employed multiplex, high-throughput digital quantification instrument by Nanostring (NanoString Technologies Inc., Seattle, WA, USA). Additionally, 6 control genes were also quantified for normalization of gene expression. A total of 300 ng of total RNA was used as an input for this analysis. Nanostring and its nCounter PlexSet technology is a digital quantification system, which quantifies RNA molecules using a target specific oligonucleotide probe pairs in a highly specific manner. PlexSet contains uniquely coded fluorescent barcodes that are linked to reporter tags and a biotinylated universal capture tag. The reporter tags emit a unique signature fluorescence that is individually resolved and counted during data capture and analysis. On the other hand, the universal capture tag anchors specific RNA molecules to streptavidin-coated lane on the nCounter instrument [48]. The Nanostring assay was performed as per the manufacturer's instructions. The data collection that involves detection, resolution and quantification of individual florescent barcodes was performed later on a separate instrument, nCounter Digital Analyzer (DA). The fields of view (FOV) setting for DA was set at 280 FOV, as previously noted [49].

mRNA Expression Data Normalization
The raw gene expression counts were processed and normalized according to the manufacturer's recommendations (NanoString Technologies Inc., Seattle, WA, USA). The geometric mean of the negative and positive control was used to normalize the data. The second normalization was later performed using 6 internal control genes (ABCF1, GUSB, HPRT1, LDHA, POLR1B, RPLO). The normalizations were performed using the nCounter software (NanoString Technologies Inc., Seattle, WA, USA).

Correlation Analysis and Gene Expression Comparison with Normal Tissue
For correlation among the genes, cluster analysis of 17 genes was performed on the basis of Spearman correlation coefficient. For normal and tumor tissue expression comparison, Gene Expression Profiling Integrative Analysis (GEPIA) database (http://gepia.cancer-pku.cn) was utilized. In GEPIA, the COAD tumor (n = 275) dataset was compared against combined gene expression data of normal tissues from TCGA and Genotype-Tissue Expression (GTEx) data (n = 349). In GEPIA, standard parameters with Log 2 FC cutoff was set at 1 and p-value cut-off at 0.01 were used.

Construction and Validation of a 4 Gene Prognostic Model
The prognostic score was generated using the Cox proportion regression coefficient for each gene. For every patient the prognostic score was calculated by multiplying the expression value of a gene with its corresponding Cox proportion regression coefficient (Prognostic score = Σ Cox regression coefficient of Gene i * expression value of gene Gene i ). Separate coefficients were calculated for both internal and external datasets. The resulting prognostic score based on these coefficients was used to divide patients into categorical variables, i.e., high score and low score groups based on median cut-off threshold. This categorical variables were utilized to differentiate patients in stage II and stage III from the internal and external datasets. The KM Analysis was performed to assess the utility of this model to differentiate these groups.

Statistical Analysis
The continuous variables in this study including Nanostring expression counts are shown as the mean ± SE. The median of the normalized counts was taken to divide patients into two groups-individuals with higher expression and with lower expression. The relationship between gene expression of these groups were compared with the categorical clinic-pathological parameters using Pearson χ 2 test. The univariate and multivariate analysis of different genes was performed using Cox proportion hazard regression method. The Hazard ratio and 95% confidence interval values were also derived from Cox proportion hazard model. Kaplan-Meier method was used to analyze survival and log-rank test was used to calculate the differences in their distribution. The calculations of p-values were two-sided, and p < 0.05 was defined as statistically significant. Additionally, ROC (Receiver operating characteristic) analysis was performed for 4 gene signature on both external and clinical datasets. The statistical analyses were conducted using JMP-Pro (version 14.0.0, SAS Institute, Cary, USA) and GraphPad Prism (version 8 GraphPad Software, La Jolla California USA).

Conclusions
In summary, our study developed a novel four gene prognostic model which has been used to predict clinical outcomes in CRC patients. Our approach to first identify risk genes from TCGA datasets and validate experimentally can be equally insightful in other cancers. There is additional research required to assess the functional role of these genes in colorectal tumors. We are in the process of validating this study on a larger cohort and independent datasets. The efforts to develop similar gene signatures promises to equip clinicians with better information to adopt novel personalized interventions for higher risk patients.