Alterations in Gene Pair Correlations as Potential Diagnostic Markers for Colon Cancer

Colorectal cancer (CRC) is a leading cause of death from cancer in Canada. Early detection of CRC remains crucial in managing disease prognosis and improving patient survival. It can also facilitate prevention, screening, and treatment before the disease progresses to a chronic stage. In this study, we developed a strategy for identifying colon cancer biomarkers from both gene expression and gene pair correlation. Using the RNA-Seq dataset TCGA-COAD, a panel of 71 genes, including the 20 most upregulated genes, 20 most downregulated genes and 31 genes involved in the most significantly altered gene pairs, were selected as potential biomarkers for colon cancer. This signature set of genes could be used for early diagnosis. Furthermore, this strategy could be applied to other types of cancer.


Introduction
Colorectal cancer (CRC), which includes both colon cancer and rectal cancer, is the third most common type of cancer in Canada [1]. According to the Canadian Cancer Society, about 24,300 Canadians were diagnosed with CRC in 2022 (~10% of all new cancer cases) [2]. Globally, it has been estimated that there could be more than 2.2 million new cases and 1.1 million deaths from CRC by the year 2030 [3]. A significant number of CRC cases are sporadic, and their etiology has been linked to lifestyle and environmental factors such as age, ethnicity, sex, diet, morbidity (diabetes or obesity), and tobacco usage [3]. However, 2-8% of CRC cased are due to inherited syndromes [3].
Current treatment options for CRC generally include surgery (endoscopic and local excision) and chemotherapy. Loco-regional surgery, local ablative therapy, targeted therapy, and immunotherapy are also applied for advanced stage and/or metastatic CRC [4]. CRC normally becomes symptomatic in the advanced stage. The five-year survival rate drops from 92% for stage I to 12% for stage IV in colon cancer patients and from 88% for stage I to 13% for stage IV in rectal cancer patients [5]. Thus, early diagnosis is essential for patient survival in CRC. It has also been shown that only about 55.2% of Canadians aged 50-74 have undergone a stool test in the last two years [6] and 62% of Americans aged 50-75 have taken a colonoscopy/sigmoidoscopy in the last 10 years [7]. Although regular colonoscopy has led to a decline in CRC in adults over 50 years old, there is an indication that the CRC rate is increasing among adults less than 50 years old in both Canada [8,9] and the USA [10,11]. A recent study characterized CRC outcomes in young adults and reported that the five-year survival rate was lower in young patients in comparison to middle-aged patients; this has been attributed to advanced tumor stage and distant metastasis at the time of presentation [12].
Up to now, there is no single treatment to treat every patient with equal efficacy [13]. While the development of an effective primary medical intervention is in process, the parallel development of cost-effective, non-invasive early detection techniques (asymptomatic stage) is also under investigation as a preventive strategy in CRC patients. Although extensive genomic and transcriptomic studies have been undertaken to identify differentially expressed genes (DEG) as potential cancer biomarkers [14,15], they have not been significantly impactful. This is because CRC is highly heterogeneous and complex and is dynamically governed by a combination of vast genes and environmental factors [16]. To better characterize CRC, gene expression has to be analyzed as a system rather than an individual event. Herein, we propose that alterations in the correlations for expression of gene pairs could serve as a supplement to the current clinical biomarkers for early disease diagnosis. This novel approach integrates multiple factors to shortlist pharmacologically relevant targets using a set of indirect and network-dependent features that can be modulated in a differential fashion to bring out the required phenotypic effect and help in disease management.

Differentially Expressed Genes (DEGs)
Gene expression data were extracted from The Cancer Genome Atlas (TCGA) dataset TCGA-COAD. The dataset contains RNA-Seq expression data and the corresponding clinical data for 471 colon cancer patients and 41 healthy controls. Using |log 2 FC| ≥ 2.50 (FC, fold change) as the cutoff, we identified 606 upregulated DEGs and 447 downregulated DEGs (Supplementary Table S1). The top 20 upregulated and downregulated DEGs are summarized in Table 1.

Pathway Analysis
The upregulated and downregulated DEGs (|log 2 FC| ≥ 2.50) were subjected to pathway analysis using InnateDB [17], which can categorize proteins into different biological pathways. The top 10 significant biological pathways for the KEGG Pathways Database were neuroactive ligand-receptor interaction, systemic lupus erythematosus, cytokinecytokine receptor interaction, pathways in cancer, bile secretion, Wnt signaling pathway, cell adhesion molecules, metabolism of xenobiotics by cytochrome P450, drug metabolism, and tight junction ( Table 2). The uploaded gene count, (i.e., number of DEGs) and the total number of genes in the top 10 biological pathways are provided in Table 2.

Alternation in Gene Pair Correlations
Carcinogenesis is a complicated and complex process, which requires the coordination of multiple genes. To understand the modulation of the top 10 significant biological pathways and the change in expression in relation to each other in colon cancer, we calculated and compared the pairwise correlations for all genes in each biological pathway between healthy controls and colon cancer patients ( Figure 1). It is clear that the genes in most of the biological pathways tend to lose their correlations/coordination upon carcinogenesis, implying that the decoupling of gene regulations might be essential for colon cancer development. This observation is consistent with our previous studies [18][19][20].

Top 20 Upregulated DEGs
As shown above, the top 20 upregulated DEGs had their fold changes more than 360 folds. Further analysis showed that four upregulated DEGs belong to the Melanoma-Associated Antigen-A (MAGE-A) cluster. The MAGE-A gene cluster is located at Xq28 in the human genome and consists of 12 genes. The genes show silent/low expression in all normal tissues except in the male germ cell and the placenta, as the MAGE-A family is involved in spermatogenesis and embryonic development. Recently, the role of Melanoma-Associated Antigen-A (MAGE-A) gene expression in various human cancers was reviewed and it was reported that genes in this family are potential therapeutic targets in cancer immunotherapies [21]. Although the exact function of these genes is not clearly understood, some members (MAGE-A2, A3/6, and A9) have been shown to be tumorigenic and involved in the dysregulation of the tumor suppressor p53 [22][23][24][25][26].
Overexpression of IGF-1 has been linked to an increased risk of colon cancer, and IGF-1 has been shown to stimulate the growth of the colon cancer cell lines HT-29 and SW-480 [27,28]. Three kallikreins, KLK6, KLK7, and KLK8, were found to be overexpressed in colon cancer. The kallikrein gene family, which is located on chromosome 19q13.4, encodes serine proteases. Proteases including kallikreins have been associated with the progression of colon cancer and reported to help in cancer progression by extracellular matrices and the invasion of surrounding tissues by the transformed cells [29,30]. KLK6, KLK7, KLK8, and KLK10 were recently reported as potential diagnostic biomarkers for colon adenocarcinoma [31]. PAEP (Glycodelin) is an immunosuppressive glycoprotein and a high level of glycodelin is observed in the serum of colon cancer patients [32]. It was reported as a biomarker for colon cancer as the serum glycodelin level is increased in patients with metastatic CRC [33]. NOTUM encodes palmitoleoyl-protein carboxylesterase, which is involved in the proliferation and migration of colorectal cancer and proposed as a diagnostic and therapeutic candidate [34].
SPRR1A, SPRR2E, and SPRR2D are three members of the small proline-rich proteins (involved in the keratinization pathway). SPRRs are critical components of the cornified cell envelope (CE). The CE is involved in the formation of an envelope beneath the plasma membrane that serves as a barrier to extracellular and environmental factors. Dysregulation results in the compromise of the barrier [35]. PRR9, a paralog of SPRR2G, is also observed to be overexpressed. SPRR1A was recently proposed as a prognostic marker of colon cancer as its expression was higher in tumors compared to the adjacent noncancerous tissues [36]. High expression of SPRR1A was reported to be associated with poor survival in colon cancer patients [37]. SPRR1A and SPRR2D are also interacting partners of KLK6 [38]. PPBP (also known as CXCL7) is a chemokine involved in the immune response during vascular injury [38,39]. It is proposed as a potential biomarker for cancers [40][41][42]. Overexpression of PPBP affects the PI3K/AKT/mTOR signaling pathway and is associated with poor prognosis for colon cancer [43][44][45].
DKK4 is a member of the Dickkopf family and modulates the Wnt signaling pathway. It was reported to enhance the migration and invasive characteristics of colon cancer cells [46]. Overexpression of DKK4 in colorectal cancer was also previously reported [47]. Zinc finger of the cerebellum 5 (ZIC5) is a transcriptional repressor. It plays an important role in colorectal cancer via regulating cell cycle progression and the modulation of CDC25/CDK/cyclin signaling [48]. Cystatin SN (cystatin 1, CST1) inhibits cysteine proteases and promotes cell proliferation and metastasis. It was reported as a biomarker in colorectal cancers [49]. COL10A is a member of the collagen family and has been reported to be overexpressed in colorectal cancer. It has also been proposed as a potential biomarker [50]. PRSS56 encodes a serine protease, which has a significant correlation with clinical stage in bile duct cancer [51]. FEZF1 is a transcriptional repressor of the zinc finger double domain protein family. No relationship with FEZF1 in CRC could be identified in the literature (as of 14 September 2022).

Top 20 Downregulated DEGs
As for the downregulated SDEGs, all of them had fold changes more than 45 folds. Four lipoproteins, APOA4, AOC3, APOA1, and APOB, were downregulated. Lipoproteins are formed by the attachment of apolipoproteins (APOs) to lipids, and hence apolipopro-teins function as lipid carriers. APOs are primarily synthesized in the liver and the intestine. APOA4 and APOA1 are involved in the transportation of chylomicrons (low-density lipoproteins, LDLs) and high-density lipoproteins (HDLs) [52], while APOB transports LDLs and APOC3 transports HDLs. APOA4 possesses anti-inflammatory activity in allergies [53]. However, no report was found on its association with colon cancer or CRC. Low ApoA-I levels have been proposed to be associated with increased mortality risk in colorectal cancer patients. APOA1 was reported to be associated with anti-inflammatory, antiangiogenic [54], immunoregulatory [55,56], and antithrombotic [57,58] activities. Although no association was found between colon cancer and APOB and APOC3 expressions, a recent report evaluated the role of APOB in bile duct cancer and demonstrated that APOB influences the infiltration degree of immune cells [59]. Furthermore, APOC3 is involved in the breakdown of triglycerides, and hence its plasma level directly correlates with the plasma triglyceride level [60].
OTOP2 and OTOP3 encode for otopetrin 2 and otopetrin 3, respectively. OTOP2 was downregulated in CRC and its expression is negatively correlated with the malignancy grade and patient survival [61]. Although the exact mechanism of action is not known, it could be regulated by wild type p53 [62]. OTOP3, which is a paralog of OTOP2, functions as a proton-selective channel and is involved in the maintenance of intracellular pH. OTOP3 has been reported as an interesting candidate for future research in colon cancer [63].
SLC10A2 (ASBT/Apical sodium-dependent bile acid transporter) and SLC30A10 (Solute Carrier Family 30 Member 10) are also downregulated in colon cancer. ASBT is involved in the reclamation of bile acids at the terminal ileum enterocyte brush border membrane [64]. ASBT-deficient mice are reported to show a 2-fold increase in the number of colon adenocarcinomas [65]. SLC30A10 has been observed to be downregulated in colorectal cancer tissues and cell lines. Low expression of SLC30A10 promotes cell proliferation and migration of colorectal cancer cells [66]. SLC30A10, which is a Mn transporter, is also involved in Zn transport in endosomes. Mn is essential for the function of several enzymes, such as those involved in the metabolism of neurotransmitters. The loss-of-function mutations of SLC30A10 are associated with adult-onset Parkinsonism. Expression of SLC30A10 and cellular Mn efflux are regulated by the composition of bile acids [67].
MS4A10 encodes for a component of a multimeric receptor complex, which is involved in signal transduction. However, its exact function is unknown. AQP8 is an aquaporin involved in water transport across the biological membranes. It is downregulated in CRC. AQP8 inhibits PI3K/AKT signaling [68]. CA1 encodes for carbonic anhydrase 1 and is reported to be downregulated in CRC patients [69]. CA1 is involved in electroneutral sodium chloride reabsorption and short-chain fatty acid uptake [69]. INSL5 belongs to the insulin superfamily and is downregulated in colon cancer. It encodes insulin-like peptide 5. Recently, INSL5 was reported as a potential therapeutic target for CRC [70]. It also plays a role in gut contractility in murine models [71]. It is important to note that colon cancer patients usually show changes in bowel movements including constipation. TMIGD1, which is a putative tumor suppressor, can induce G2-M cell cycle checkpoint arrest in colon cancer cells and is correlated with poor overall survival [72]. GUCA2B encodes uroguanylin, which is an endogenous hormone functioning as a paracrine endogenous ligand to regulate proliferation, metabolism, and barrier function in the intestine by binding and activating guanylate cyclase C (encoded by GUCY2C) [73].
G6PC is a glucose-6-phosphatase catalytic subunit component gene. Glucose-6-phosphatase is involved in gluconeogenesis and glycogenolysis. Thus, it plays a key role in glucose homeostasis. Alterations in glucose metabolism have been noticed in many types of cancer, including colon cancer [74]. CPO encodes carboxypeptidase O, which is involved in the small intestine phase of protein digestion. It is a membrane-anchored brush-border enzyme that enables the C-t proteolysis of the great majority of amino acids present in dietary proteins [75]. CPO was observed to be downregulated in our current analysis and no report on its downregulation in either colon cancer or CRC was found in the literature. KRTAP13-2 encodes Keratin Associated Protein 13-2, which was also downregulated in rectal cancer [76]. Peptide YY (encoded by PYY) is a gut hormone. Decreased expression of peptide YY has been reported to be relevant to the development and progression of colon adenocarcinoma [77]. BEST4 is significantly expressed in the colon and belongs to the bestrophin gene family of anion channels. However, it is dramatically downregulated in CRC [78]. Claudin 8 (encoded by CLDN8) is part of the tight junction in cell membranes. In contrast to our observation, CLDN8 was previously reported to be overexpressed in CRC patients and promoted cell proliferation, migration, and invasion of colorectal cancer cells [79].
We herein provide a brief overview on the functions of the top 40 modulated DEGs in colon cancer patients. Several genes have been identified as targets to develop therapeutic agents against colon cancer. However, early detection remains crucial for cancer patient survival, despite recent progress in drug development. Although several biomarkers and targets have been proposed for colon cancer (some of them were discussed above), there is lack of a technique that can address the issue of heterogeneity and differential expression of biomarkers in colon cancer in a unified and effective manner among diverse populations. Here, we investigated the alterations in gene pair correlations of the top 10 regulated biological pathways in colon cancer and identified a set of genes that could be applied for the early diagnosis of colon cancer.

Alternation in Gene Pair Correlations of the DEGs
In order to identify potential genes that could be used for the early detection of colon cancer, we calculated the changes of gene pair correlation coefficients (CCs) between colon cancer patients and health controls and identified a set of gene pairs with |log 2 FC| ≥ 2.50 and |∆CC| (i.e., CC cancer − CC control ) ≥ 0.70.

Design an Assay for Potential Early Diagnosis of Colon Cancer
The human body is a complex system. Any physiological process involves the delicate regulation of a system that contains many genes. These genes coordinate one another to fulfill the normal biological and physiological functions. Thus, any pathophysiological process, such as carcinogenesis and cancer development, would involve the disruption of current regulations and/or the re-establishment of new regulations. Our previous studies have shown that genes tend to lose their coordination in different types of cancer. Thus, we undertook the current study in an attempt to develop an assay that could be used for the early diagnosis of colon cancer. From the above analysis, we identified 25 gene pairs, which involve 32 genes, that met the significance criteria of |log 2 FC| ≥ 2.50 and |∆CC| ≥ 0.70. Although only one gene, SLC10A2, is in the list of the top 20 downregulated genes, the other genes play important roles in regulating vast biological, physiological, and/or pathophysiological processes. For example, gene AGTR1 encodes the angiotensin II receptor type 1. High expression of AGTR1 will stimulate the expression of vascular endothelial growth factor (VEGF), the key growth factor controlling angiogenesis, and is associated with a poor prognosis for colorectal cancer [81][82][83]. Herein, we propose an assay array, which includes the top 20 upregulated genes, top 20 downregulated genes, and the 31 genes involved in the significantly altered gene pairs, for potential early diagnosis of colon cancer. This array takes both gene expression and gene coordination into consideration. It would be more advantageous over the traditional biomarkers that only involve gene or protein expression. Further studies are warranted to validate and optimize this array using both tissue and fecal samples from colon cancer patients.

Data Acquisition
The colon cancer dataset TCGA-COAD was downloaded from The Cancer Genome Atlas (TCGA) via the Genomic Data Commons (GDC) data portal. It contains the RNA-Seq and clinical information for 471 colon adenocarcinoma patients and 41 healthy controls. For every patient or control, we analyzed the expression of 60,483 RNA transcripts in terms of FPKM values.

Identification and Visualization of Differentially Expressed Genes
Differentially expressed genes (DEGs) were identified using the DEGseq package from R based on a previously published protocol developed in our laboratory [18][19][20]. The output was expressed in normalized Log 2 FC (FC, fold change). Information on the change in expression for the genes in colon cancer patients compared to healthy controls was extracted using p < 0.001 and |log 2 FC| ≥ 2.50 as the significance criteria.

Biological Pathway Analysis
A Pathway Overrepresentation Analysis was performed using InnateDB [17], a publicly available resource that predicts biological pathways based on experiment fold change data sets, based on levels of differential gene expression. Pathways were assigned a probability value (P) based on the number of proteins present in a particular pathway and the degree to which they were differentially expressed or modified, relative to a control condition. The top 10 biological pathways that are significantly modulated in colon cancer were identified to be neuroactive ligand-receptor interaction, systemic lupus erythematosus, cytokine-cytokine receptor interaction, pathways in cancer, bile secretion, Wnt signaling pathway, cell adhesion molecules, metabolism of xenobiotics by cytochrome P450, drug metabolism, and tight junction.

Correlation Matrices
The correlation matrix for each biological pathway was computed using the NumPy Python package. These matrices were then visualized using the pandas and matplotlib Python packages. Positive and negative correlations were represented in blue and red, respectively. Correlation plots for the top 10 biological pathways were then plotted for cancer patients and normal controls (Figure 1). Gene pairs showing a change in correlation of |∆CC| ≥ 0.70 (significance cutoff level: 0.7) were identified and analyzed.

Conclusions
In this study, we integrated differential gene expression and gene expression correlation to identify a panel of 71 genes that may work together in concert and could potentially be used for the early diagnosis of CRC. Further studies including in vitro and in vivo validations are needed to confirm the roles they play both individually and coherently in the development of CRC. The major limitations of the current study are the small sample size and the lack of sufficient pathological information for the patients, which may decrease the sensitivity and specificity of our research protocol. We are currently endeavoring to establish collaborations with oncologists in both Canada and foreign countries in order to improve and optimize this potential CRC diagnostic panel of genes. Furthermore, we will conduct more detailed analysis of CRC in the future by including patients' clinical information, such as tumor stage, tumor grade, and metastatic status.