Identification of 42 Genes Linked to Stage II Colorectal Cancer Metastatic Relapse

Colorectal cancer (CRC) is one of the leading causes of cancer mortality. Metastasis remains the primary cause of CRC death. Predicting the possibility of metastatic relapse in early-stage CRC is of paramount importance to target therapy for patients who really need it and spare those with low-potential of metastasis. Ninety-six stage II CRC cases were stratified using high-resolution array comparative genomic hybridization (aCGH) data based on a predictive survival algorithm and supervised clustering. All genes included within the resultant copy number aberrations were each interrogated independently at mRNA level using CRC expression datasets available from public repositories, which included 1820 colon cancers, and 167 normal colon tissues. Reduced mRNA expression driven by copy number losses and increased expression driven by copy number gains revealed 42 altered transcripts (29 reduced and 13 increased transcripts) associated with metastatic relapse, short disease-free or overall survival, and/or epithelial to mesenchymal transition (EMT). Resultant genes were classified based on gene ontology (GO), which identified four functional enrichment groups involved in growth regulation, genomic integrity, metabolism, and signal transduction pathways. The identified 42 genes may be useful for predicting metastatic relapse in stage II CRC. Further studies are necessary to validate these findings.


Introduction
Microarray comparative genomic hybridization (aCGH) has been extensively used to profile colorectal cancer (CRC) for copy number aberrations. However, their direct relevance to prognosis and therapeutic interventions has remained elusive [1,2]. Undoubtedly, early cancer detection contributes to a better patient outcome. Fecal occult blood testing, fecal immunochemical/molecular testing, and colonoscopy-based screening have become wide-spread in developed countries as a preventive measure for the detection of CRC in its early stages. However, 10%-30% of early-stage CRC still relapse with metastases within two years of surgical treatment [3,4]. To address this critical issue, various technologies such as aCGH, metaphase, and next-generation sequencing were employed in attempts targeted towards the identification of suitable prognostic biomarkers in early stage CRC. Array-based analysis of DNA copy number aberrations (CNAs), in particular, has proven to be very effective in identifying recurrent CNAs in CRC. It is usually argued that CNAs of recurrent nature may imply their evolutionary importance in driving CRC progression [5,6]. For this reason, large bodies of publications have attempted to ascertain early stage CRC patients at risk of metastatic relapse based on the genomic profiling of primary tumors. However, such attempts rarely identified reliable and replicable CNAs or genes predictive of survival. Several predictive gene signatures have been reported for CRC [7,8]. Yet, these studies suffer limitations such as small sample size and lack of adequate expression-based validation. This is not surprising given that most CNAs encompass large genomic aberrations that usually harbor hundreds of bystander genes among which only few genes may be potentially useful for prognostication [8]. Therefore, validation studies linking genomic aberrations at the DNA level to transcription have been limited [7]. The two questions that need to be addressed are; first, how can we identify key "driver" CNA events from the enormous amount of random "passenger" CNAs that have no functional significance? Second, how can we pinpoint the few metastasis suppressor genes residing within large areas of genomic deletions? Here, we attempted to address these questions. Firstly, we identified CNAs exclusively found in stage II metastatic CRC using stringent experimental and statistical algorithms. We then grouped the CNAs generated based on their significant effects on disease survival. The resultant list of potential metastasis-associated genes (958 genes) residing within CNA regions were each independently validated for aberrant expression in publicly available CRC cohorts' expression data. This validation identified 42 genes, potentially involved in metastasis. Twenty-nine metastasis suppressor genes whose copy number deletions resulted in diminished transcription, and reduced overall survival (OS) or disease-free survival (DFS), and 13 potentially metastasis-enhancing genes whose copy number gains and overexpression associated with metastatic relapse, short disease-free or overall survival or/and epithelial to mesenchymal transition (EMT). A large and randomized study focused on the 42 genes to confirm our data and validate their clinical utility is now warranted. Table 1 illustrates the clinicopathological characteristics of 96 stage II CRC cases investigated in our study. The data show that pathological classification alone has limited efficacy in distinguishing subclasses of stage II CRC in terms of aggressiveness or prognosis. To refine our search for genes involved in metastasis, we excluded seven cases that later presented with local recurrence. Significance testing for aberrant copy number (STAC) analysis focused on chromosomal CNAs of 89 CRC cases, which produced acceptable derivative of log ratio spread (DLRS) values below 0.5. DLRS is defined as the spread of the log ratio differences between consecutive probes along all chromosomes. Comparison using stringent criteria between CNAs found in 11 stage II CRC with metastatic disease and CNAs found in 78 stage II CRC cases that remained disease-free resulted in the identification of chromosomal CNAs significantly enriched in metastatic CRC (Figure 1a). The most frequent copy number gains (CNGs) were in chromosomes seven (56%), 8q (56%), 13q11-q34 (61%), and 20q11.1-q13.33 (79%). Other less frequent regional gains include; chromosomes 1q, 9p, and 17q. The most frequent copy number losses (CNLs) were in chromosome arms 1p (71%), 8p (72%), 17p (55%), 22q (60%), and chromosomes 14 (77%), 15 (66%), and 18 (80%). To narrow down genes with predictive potential, a second analytical approach that is dependent solely on survival data and not whether the case relapsed with metastases or not was utilized to point out CNAs shared by CRC cases with reduced DFS (Figure 1b). The two approaches were compared to compile a common CNA profile that may influence metastasis and DFS (Figure 1c). The most significant chromosomal gains that are consistently associated with metastatic CRC recurrence and poor survival were gains in chromosomes 1q, 9p, 13q11-q34, 17q, and 20q11.1-q13.33; and losses in chromosomes 5q12.1-q35.3, 8p12-p23.3, 9q33.1-q34.3, 11q23.3-q25, 14q11.2-q32.31, 15q21.1-q26.3, 18p11.31-q21.1, 20p12.1-p13, and 22q11.21-q13.31. There were 1099 genes located within significant CNAs that were further scrutinized for their existence in normal healthy individuals [9,10]. CNAs frequent in healthy normal genomes that overlapped 5%-100% with mapped CRC metastatic vs. non-metastatic chromosomal gains were excluded from further analysis. Interestingly, all CNAs, except for 8p chromosomal region harboring SPAG11A, had minimal overlap with common CNAs in normal healthy individuals. This elimination step resulted in 958 candidate genes for further investigation (Table S1). While some of the CNAs housed one to a few genes, larger CNAs contained many genes that may modify cancer progression or are more likely "innocent passenger" CNAs that are of no functional significance. In addition, CNA profiles segregated according to our cohort's clinicopathological characteristics also resulted in significant clustering into identifiable groups. Distinct CNAs were found between microsatellite stable (MSS) and microsatellite instable (MSI) CRC, and other clinicopathological characteristics (manuscripts submitted elsewhere). However, we chose to focus here on metastasis associated CNAs. Consequently, the expression level of each of the 958 genes resultant from metastatic CRC CNAs profile was interrogated independently using publicly available CRC datasets to validate their clinical relevance based on; their differential expression in normal and CRC samples, association with grade, microsatellite instability, stage, DFS, overall survival (OS), and EMT score. Copy number aberrations in CRC stage II using the first analytical approach STAC (A), which compares the chromosomal copy numbers found in metastatic CRC with non-metastatic CRC; and the second analytical approach; the survival predictive algorithm (B), which identifies chromosomal copy number gains associated with reduced disease-free survival; (C) shows the frequencies of the aberrations in the two approaches and identifies CNA common to both methods (100%). Each column represents a chromosome with blue bars indicating copy number gains and red bars copy number losses.

Expression Analysis of Candidate Oncogenes
Approximately 307 genes were not represented on the Affymetrix U133Plus2 platform (Santa Clara, CA, USA), of which 166 genes had probes on U133A platform. Genes without a probe on either platform were mostly miRNA genes and were excluded. At first, two categorical variables were considered, DFS and OS to select CNAs' genes of altered expression. This was done to isolate genes of strong prognostic value. Interrogating gene expression changes driven by CNAs revealed 42 altered transcripts (29 reduced and 13 increased) associated with metastatic relapse, short DFS, or OS ( Table 2). Table 2. Cox regression for survival analysis of the 42 genes' mRNA expression in meta-cohort (Affymetrix U133A, n = 1820; or U133Plus2, n = 1436) for overall survival (OS), and disease-free survival (DFS). A negative regression coefficient implies a better prognosis for patients if they retain the function of a given gene (worse prognosis when the cancer underexpresses the gene). Conversely, a positive regression coefficient means that the hazard is higher for a given gene's overexpression and, thus, the prognosis worse.  Figure 1. Copy number aberrations in CRC stage II using the first analytical approach STAC (A), which compares the chromosomal copy numbers found in metastatic CRC with non-metastatic CRC; and the second analytical approach; the survival predictive algorithm (B), which identifies chromosomal copy number gains associated with reduced disease-free survival; (C) shows the frequencies of the aberrations in the two approaches and identifies CNA common to both methods (100%). Each column represents a chromosome with blue bars indicating copy number gains and red bars copy number losses.

Expression Analysis of Candidate Oncogenes
Approximately 307 genes were not represented on the Affymetrix U133Plus2 platform (Santa Clara, CA, USA), of which 166 genes had probes on U133A platform. Genes without a probe on either platform were mostly miRNA genes and were excluded. At first, two categorical variables were considered, DFS and OS to select CNAs' genes of altered expression. This was done to isolate genes of strong prognostic value. Interrogating gene expression changes driven by CNAs revealed 42 altered transcripts (29 reduced and 13 increased) associated with metastatic relapse, short DFS, or OS (Table 2). Table 2. Cox regression for survival analysis of the 42 genes' mRNA expression in meta-cohort (Affymetrix U133A, n = 1820; or U133Plus2, n = 1436) for overall survival (OS), and disease-free survival (DFS). A negative regression coefficient implies a better prognosis for patients if they retain the function of a given gene (worse prognosis when the cancer underexpresses the gene). Conversely, a positive regression coefficient means that the hazard is higher for a given gene's overexpression and, thus, the prognosis worse.  Our findings show that several candidate oncogenes of significant prognostic potential reside within regions of chromosomal gains and are overexpressed, most likely, under the influence of gene amplification within these regions. Moreover, our results pinpointed metastasis-suppressor genes of significant prognostic potential residing within chromosomal loss areas, and are underexpressed due to genetic deletions. To further refine our list for genes with the greatest prognostic power, we selected genes significantly associated with DFS or OS using a stringent criterion (p-value < 0.005), and compared their expression levels in tumor against normal colon tissues. A total of 19 genes had highly significant associations with DFS or OS, of which only 14 had significantly differential expression in normal colon vs. CRC tissues ( Figure S1). Therefore, these genes have the greatest discriminatory application in clinical prognostication of stage II CRCs at risk of metastatic relapse. We next focused on evaluating the combined potential of a set of deleted genes in relation to their combined effects on DFS or OS. Our hypothesis was that double or more deletions of candidate tumor suppressor genes will significantly increase their effect on DFS and OS in CRC patients. Combination analysis results show that while single deleted gene reduced expression associated with a significant reduction in DFS or OS, their combined underexpression analysis in CRC strengthened their prognostic potential in association with reduced DFS/OS ( Figure 2).
Lastly, we analyzed each of the 42 genes' expression levels in relation to the clinicopathological characteristics of analyzed CRC samples from the GEO database, which included; tumor grade, cancer stage (stages I-IV), expression in MSS vs. MSI tumors, expression levels in tumors vs. normal colon tissues, and their relationship to EMT in cancers. GEO database clinicopathological correlation data for all of the 29 potentially metastasis-suppressor genes and the 13 potentially metastasis-enhancer genes are shown in (Tables S2-S9).

Discussion
CRC represents a heterogeneous group of gastrointestinal neoplasia. Stage II CRC is the most clinically challenging stage as it is the critical point between remission and progression into aggressive stages. Stage II CRC has 20%-30% risk of metastasis, whereas stage III CRC has a 50%-80% risk of distant metastasis [11]. The genetic complexity of stage II CRC is reflected by its extensive CNA profile that presents an additional limitation of sifting the "culprit" genes from the

Discussion
CRC represents a heterogeneous group of gastrointestinal neoplasia. Stage II CRC is the most clinically challenging stage as it is the critical point between remission and progression into aggressive stages. Stage II CRC has 20%-30% risk of metastasis, whereas stage III CRC has a 50%-80% risk of distant metastasis [11]. The genetic complexity of stage II CRC is reflected by its extensive CNA profile that presents an additional limitation of sifting the "culprit" genes from the "innocent passenger" ones. The ongoing theory is that cancer metastases arise due to the acquisition of the metastatic phenotype, which appears to have an underlying metastatic genotype [12,13]. We generated genomic CNA profiles for stage II CRCs, with and without subsequent metastatic disease to identify CNAs associated with metastasis or reduced OS/DFS in stage II CRC. We then utilized publicly-available high-density oligonucleotide-based microarray expression data to study the effect of these CNAs. The use of well-defined expression profiles from sets of colorectal cancers may serve as reliable and focused validation cohorts. Our resultant genes were individually scrutinized for association with cancer biology. In addition, a stringent gene selection method was employed to exclude false positive CNAs that occur in normal colon samples, and exclude genes whose expression levels in CRC were comparable to normal colon tissues. All 42 genes' expression had significant association in predicting OS and/or DFS ( Table 2). Functional annotation of each gene confirms that each of these genes partake in key processes of cancer progression and metastasis, providing insights into the genetic mechanisms driving metastatic CRC. The 42 genes were divided into genes deleted (29 genes) and genes amplified (13 genes) of which 26 deleted genes and 13 amplified genes had specific known functions. Each gene was literature-researched for reported evidence of known biological function(s) and involvement in cancer.
In the deleted genes list several genes had known tumor suppressor and anti-metastatic properties (Table S10). CSMD1, FAM83F, and EPHX2 are known to be silenced, mutated, or deleted in gastrointestinal cancers though their precise functions remain unknown [14,15]. ADRA1A, APOBEC3D, CABIN1, DIO3, GP1BB, HNF6, MCM8, PCDHGA11, WDR5, ZNRF3, and ZNF366 all have functional attributes that are known to control and affect the transition into metastatic pathways. However, CABIN1 did not significantly associate with our EMT potential analysis (p = 0.8), and associated with distinguishing MSS CRC (p < 0.05, Tables S4 and S5). CABIN1 is a pro-apoptotic gene, which suggests that its reduced expression is an early marker for CRC's proliferation and genomic instability [16]. EMT fundamentally involves genomic and phenotypic changes of epithelial cells into mesenchymal cells, loss of epithelial cell-cell contact, loss of cell polarity, and acquisition of migratory characteristics facilitating metastasis [17]. ADRA1A and ZNF366 have proliferation regulatory functions that limit tumor cells from forming tumors/metastases in response to stimulators [18,19]. Genomic instability and loss of epithelial genomic signatures are also a requirement for EMT. Three deleted signature genes; APOBEC3D, MCM8 and WDR5 have genomic stability, genomic fidelity, and epigenetic fidelity functions, respectively [20][21][22]. Downregulation of cell-cell contact to facilitate EMT and tumor mass blood transport is achieved by deletion of epithelial cell adhesion regulatory genes that include GP1BB and PCDHGA11 in our resultant deleted genes list [23,24]. DIO3 is a metabolic suppressor, which has cancer growth rate-limiting properties and may mitigate EMT [25]. Lastly, ZNRF3 and HNF6 are EMT and cell-differentiation regulators under normal conditions and are tumor suppressor genes inhibiting the EMT pathway [26,27]. Four deleted genes had contradictory functions to EMT inhibition but had a negative EMT score based on our analysis. These genes associated with other characteristics of CRC based on their expression analysis. ADRA1D was differentially expressed in stage IV compared to stages II and III. NAT1 and NAT2, drug metabolism genes; whose expression was associated with stage I MSS CRC. BRF2, a pro-proliferation gene, had a positive EMT score and its expression was associated with MSS CRC. Activation of EMT was shown to reduce cell proliferation by targeting specific pro-proliferation genes [28,29]. It is possible that the loss of BRF2 functions in the primary tumor would result in the activation of alternative proliferation pathways in distant metastases during reversion of EMT. The remaining nine deleted genes were of unknown functions, albeit associating with several characteristics of CRC (Tables S2-S5), which warrants further functional studies of these genes.
All amplified genes had known functions, except for ANXA2P2, which is a pseudogene for the mesenchymal marker Annexin A2. Most amplified genes had positive association with EMT except for SEMG1, SMU1 and ING1. SMU1 and ING1 regulate genomic integrity and cell growth and their function could be altered by aberrant expression or gene structural changes that resulted in their association with MSS CRC [30,31]. Similarly, SEMG1 expression, a negative regulator of calcium import, is associated with MSS CRC. Metabolic advantage is a feature of metastatic tumors and might be mediated by overexpression of VLDLR, USP32, and PITPINC1 [32][33][34]. Although PITPINC1 had a non-significant EMT score, its expression was associated with MSS CRC. Five amplified genes were involved in multiple signaling cascades that can have pleiotropic effects. Those genes were DOK5, DUSP14, GLIS3, PITPNC1, and SEMG1. These genes are partly characterized for their involvement in cancer and would require further studies. Two amplified genes had contradictory functions to being overexpressed in metastatic cancer, SCEL (a keratinocyte differentiation factor), and MPDZ (a cell adhesion factor). However, both genes are involved in recruitment and assembly of proteins to cellular surfaces promoting protein-protein interactions. Aberrant expression of these proteins may be involved in the formation of invadopodia, which facilitates invasion of the local extracellular matrix and basement membranes [35,36]. Lastly, it should be noted that our study here is limited to estimating copy number aberrations; it does not discern sequence mutations or epigenetic changes.
In conclusion, our study characterized the personal nature of stage II CRC CNAs highlighting key genes involved in promoting CRC metastasis. Resultant genes were mostly involved in cell growth regulation, metabolic regulation, and signaling pathways (Figure 3). We have identified some novel candidate genes of possible prognostic value for both overall survival and DFS. As well as novel genes that can be used for CRC classification. These genes can offer new targets for diagnostic and therapeutic designs. We anticipate validating our findings in separate retrospective and prospective studies for their efficiency in predicting metastatic stage II CRC. import, is associated with MSS CRC. Metabolic advantage is a feature of metastatic tumors and might be mediated by overexpression of VLDLR, USP32, and PITPINC1 [32][33][34]. Although PITPINC1 had a non-significant EMT score, its expression was associated with MSS CRC. Five amplified genes were involved in multiple signaling cascades that can have pleiotropic effects. Those genes were DOK5, DUSP14, GLIS3, PITPNC1, and SEMG1. These genes are partly characterized for their involvement in cancer and would require further studies. Two amplified genes had contradictory functions to being overexpressed in metastatic cancer, SCEL (a keratinocyte differentiation factor), and MPDZ (a cell adhesion factor). However, both genes are involved in recruitment and assembly of proteins to cellular surfaces promoting protein-protein interactions. Aberrant expression of these proteins may be involved in the formation of invadopodia, which facilitates invasion of the local extracellular matrix and basement membranes [35,36]. Lastly, it should be noted that our study here is limited to estimating copy number aberrations; it does not discern sequence mutations or epigenetic changes.
In conclusion, our study characterized the personal nature of stage II CRC CNAs highlighting key genes involved in promoting CRC metastasis. Resultant genes were mostly involved in cell growth regulation, metabolic regulation, and signaling pathways ( Figure 3). We have identified some novel candidate genes of possible prognostic value for both overall survival and DFS. As well as novel genes that can be used for CRC classification. These genes can offer new targets for diagnostic and therapeutic designs. We anticipate validating our findings in separate retrospective and prospective studies for their efficiency in predicting metastatic stage II CRC.

CRC Samples
DNA extracted from formalin-fixed paraffin-embedded (FFPE) tissues from 96 patients with sporadic early stage II CRC were used for genomic profiling. Genomic DNA was isolated from microdissected FFPE CRC tissues as described previously [37].

Microsatellite Instability Analysis
Microsatellite fragment analysis was performed on FFPE-extracted DNA using MSI Analysis System Version 1.2 kit (Promega, Madison, WI, USA). Spectral calibration was carried out on the Applied Biosystems 3130 genetic analyzer using the Powerplex Matrix Standards 3100/3130 kit (Promega). Promega's MSI Analysis System includes fluorescently labeled primers for amplification of seven markers; five mononucleotide repeat markers (Bat-25, BAT-26, NR-21, NR-24, and MONO-27), and two pentanucleotide repeat markers (Penta C and D). Mononucleotide markers are used to determine the MSI status, whereas pentanucleotide markers are used to detect potential sample mix-up by validating that tumor and matching normal samples are from the same individual. DNA concentrations of 10-20 ng from normal and tumor samples were used for the fluorescent PCR-based assay. Microsatellite instability was determined by comparing allelic profiles of microsatellite markers generated by amplification of normal and tumor DNA. Internal lane size standard ILS600 was added to amplified samples to ensure accurate sizing of alleles. A loading cocktail was prepared by mixing the ILS600-PCR product with highly-deionized formamide, and denatured prior to loading onto the 3130 Genetic Analyzer for capillary electrophoresis. The sample's fragment separation output data were analyzed using GeneMapper software version 4.0 (Applied Biosystems, Foster City, CA, USA). CRC samples were classified as MSS if no marker showed any length variation compared with its matching normal colon mucosa. When two or more of the markers showed length mutation in CRC compared with its matching normal mucosa, the CRC sample was labeled as MSI-high.

Genomic Landscaping Using aCGH
Array CGH was carried out on 96 CRC samples following our standard published protocol [38]. In summary, 2 µg of tumor DNA and sex-matched pooled reference DNA (Promega, Madison, WI, USA) were sonicated in a water bath. The universal linkage system (ULS) Cy3 and Cy5 (Agilent Technologies, Santa Clara, CA, USA) dyes were used to label DNA according to the manufacturer's protocol. Differentially-labeled DNA was purified by Agilent KREApure columns (Agilent Technologies, Santa Clara, CA, USA). Purified labeled tumor and reference samples were hybridized onto Human Genome CGH arrays 244A slides (Agilent Technologies, Santa Clara, CA, USA) in SureHyb chambers (Agilent Technologies, Santa Clara, CA, USA) for 40 h at 60˝C. Washing and scanning were carried out according to manufacturer's protocol. Slides were scanned immediately on an Agilent microarray scanner at 5 µm resolution to minimize the impact of environmental factors on signal intensities. Data was extracted from microarray image files using Feature Extraction software (Version 9.5, Agilent Technologies, Santa Clara, CA, USA) and analyzed using Nexus Copy Number™ software (BioDiscovery, El Segundo, CA, USA). Quality values generated from data analysis ranged between 0.05-0.4, and to minimize false positive calls and random copy number variations BioDiscovery's Fast Adaptive State Segmentation Technique (FASST2) algorithm with a stringent significance threshold of 5.0ˆ10´6 was used to determine copy number states.

Data Clustering and Statistical Analysis
Traditional means of classifying the importance of cancer-related copy number gain or loss include the frequency of their occurrence in different patients. However, cancer genomes are highly complex and frequently harbor random "passenger" copy number aberrations of no functional significance.
To minimize interference from these passenger events, a systematic statistical approach termed significance testing for aberrant copy number (STAC) was employed [39]. The STAC-based algorithm is a robust method, which identifies a set of aberrations that are stacked on top of each other from different samples such that it would not occur by chance. To find these events, aberrations (usually narrow regions) were permutated in each arm of each chromosome and the likelihood for an event occurring at any location at a particular frequency was calibrated. Here, we used STAC to compare CNAs from 11 stage II primary CRC cases with confirmed distant metastases based on a 10-year follow up, and 78 stage II primary CRC relapse-free cases (Table 1). We defined metastatic CRC as distant cancer recurrence away from the primary site. Disease/relapse-free status is defined as no distant metastases (liver, brain, or bone) or local recurrence. We set the likelihood frequency stringently at 35% and a significance p-value of <0.05 [39]. To further refine the CNA to include clinically-relevant aberrations, we reanalyzed the data using a survival predictive power statistical approach (SPPS). SPPS correlates CNA with survival time. In this analysis samples with similar CNA regions are grouped (In-Group) and their mean survival calculated and compared to samples without the named CNA (Out-Group). The mean survival times were then compared using the log rank test. This approach highlights CNAs that influence survival time and may be more clinically relevant. Such permutations are not considered in the STAC analysis we used; therefore, the two methods may be considered complementary. Results from both approaches were compared to narrow down the chromosomal regions and resident genes associated with CRC metastasis and disease-free status.

Estimation of Epithelial-Mesenchymal Transition Score
Derivation of EMT signature and estimation of EMT score were described previously [41]. Briefly, a curated EMT gene set [42], after removing basal keratins genes [43,44], was used as an initial selection of epithelial and mesenchymal cell lines. The first step was to establish an EMT signature using binary regression (BinReg) 2.0 [45] comparing profiles of colon cell lines with low or high EMT enrichment score. Briefly, BinReg uses a Bayesian statistical analysis to fit a binary probit regression model on training data given a set of genes that are most correlated with the EMT phenotype. The regression coefficients of these genes indicate the discriminating power of the gene and are weights for the overall meta-gene profile. The overall meta-gene profile is subsequently used for comparison and predicts the status of the phenotype of the new sample or dataset. In the second step, the BinReg colon cancer EMT signature was applied to predict the EMT status of colon cancer tumor or cell lines. In the third step the extreme 25% samples with the highest probabilities for epithelial or mesenchymal phenotype were used to obtain the epithelial or mesenchymal specific gene list for the colon cancer tumors or cell lines using Significance Analysis of Microarray (SAM) q-value = 0 and receiver-operating characteristics curve (ROC) value of 0.85. SAM computed the expression fold change of the genes based on the EMT phenotype, and assessed the significance by comparing the observed fold change against the background expression fold change estimated from 1000 permutations of the samples' EMT phenotypes. On the other hand, ROC method constructs a ROC curve assessing the predictive power of a gene with respect to the EMT phenotypes. In the fourth step, single-sample GSEA (ssGSEA) was employed to compute the enrichment score of a tumor or cell line based on the expression of the colon cancer tumor-or cell line-specific epithelial or mesenchymal signature genes [46]. In ssGSEA, a Kolmogorov-Smirnov-based method, the enrichment of a signature is estimated by comparing the empirical cumulative distribution function of genes defined by the signature, vs. the genes not in the signature. EMT score is defined as the normalized subtraction of the mesenchymal from epithelial enrichment scores. The EMT score is an estimate for the cell line status as epithelial or mesenchymal phenotype with´1.0 indicates fully epithelial and +1.0 fully mesenchymal.

Conflicts of Interest:
The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.