Identification of Key Candidate Genes and Pathways in Colorectal Cancer by Integrated Bioinformatical Analysis

Colorectal cancer (CRC) is one of the most common malignant diseases worldwide, but the involved signaling pathways and driven-genes are largely unclear. This study integrated four cohorts profile datasets to elucidate the potential key candidate genes and pathways in CRC. Expression profiles GSE28000, GSE21815, GSE44076 and GSE75970, including 319 CRC and 103 normal mucosa, were integrated and deeply analyzed. Differentially expressed genes (DEGs) were sorted and candidate genes and pathways enrichment were analyzed. DEGs-associated protein–protein interaction network (PPI) was performed. Firstly, 292 shared DEGs (165 up-regulated and 127 down-regulated) were identified from the four GSE datasets. Secondly, the DEGs were clustered based on functions and signaling pathways with significant enrichment analysis. Thirdly, 180 nodes/DEGs were identified from DEGs PPI network complex. Lastly, the most significant 2 modules were filtered from PPI, 31 central node genes were identified and most of the corresponding genes are involved in cell cycle process, chemokines and G protein-coupled receptor signaling pathways. Taken above, using integrated bioinformatical analysis, we have identified DEGs candidate genes and pathways in CRC, which could improve our understanding of the cause and underlying molecular events, and these candidate genes and pathways could be therapeutic targets for CRC.


Introduction
Colorectal cancer (CRC) is one of the most common malignancies in the world, there were estimated more than 777,000 of new cases in 2015 in the developed countries [1,2], and in China alone, there were about 376,000 of new CRC cases and 191,000 of death CRC cases in 2015, accounting for the fifth of malignant tumor incidence and mortality [3]. Although there are extensive studies on the mechanism in colorectal cancer formation and progression, the causes of colorectal cancer is still not clear. The occurrence and progression of colon cancer are correlated with multiple factors from the point of view of science and research, for instance, gene aberrations, cellular and environmental factors [4]. Due to high morbidity and mortality in colorectal cancer, revealing the causes and the underlying molecular mechanisms, discovering molecular biomarkers for early diagnosis, prevention and personalized therapy, is critically important and highly demanded.
Gene chip or gene profile is a gene detection technique that has been used for more than ten years, using gene chips can quickly detect all the genes within the same sample time-point expression information, which is particularly suitable for differentially expressed genes screening [5]. With the Int. J. Mol. Sci. 2017, 18, 722 2 of 15 wide application of gene chips, a large number of cores slice data have been produced, and most of the data have been deposited and stored in public databases. Integrating and re-analyzing these data can provide valuable clues for new research. Many microarray data profiling studies have been carried out on CRC in recently years [6], and hundreds of differentially expressed genes (DEGs) have been obtained. However, the results are always limited or inconsistent due to tissue or sample heterogeneity in independent studies, or the results are generated from a single cohort study. Thus, no reliable biomarkers have been identified in CRC. However, the integrated bioinformatics methods combining with expression profiling techniques will be innovative and might solve the disadvantages.

Identification of DEGs in Colorectal Cancers
NCBI-GEO is a free database of microarray/gene profile and next-generation sequencing, from which colorectal cancer and normal or adjacent mucosa tissue gene expression profile of GSE28000, GSE21815, GSE44076 and GSE75970 were obtained. The microarray data of GSE28000 had 43 American Africans (AA) and 43 European Africans (EA) CRC tissues and 40 normal colon tissues [7]. The GSE21815 data had 131 CRC tissue and 9 normal colon tissue [8,9]. The GSE44076 data included 98 CRC tissues and 50 healthy colon mucosa [10,11], and the GSE75970 data had 4 pairs of colorectal cancer tissues and matched paraneoplastic tissues. Using p < 0.05 and [logFC] > 1 as cut-off criterion, we extracted 1572, 8830, 2612 and 3850 differentially expressed genes (DEGs) from the expression profile datasets GSE28000, GSE21815, GSE44076 and GSE75970, respectively. After integrated bioinformatical analysis, total of 292 consistently expressed genes were identified from the four profile datasets (Figure 1), including 165 up-regulated genes and 127 down-regulated genes in the colorectal cancer tissues, compared to normal colon tissues (Table 1). Employing Morpheus software, we developed a heat map of the 165 up-regulated and 127 down-regulated DEGs using data profile GSE44076 as a reference, showing the significantly differential distribution of the 292 DEGs ( Figure S1).

Figure 1.
Identification of 292 commonly changes DEGs from the four cohort profile data sets (GSE28000, GSE21815, GSE44076 and GSE75970) using Morpheus Website (Available online: https://software.broadinstitute.org).Different color areas represented different datasets. The cross areas meant the commonly changed DEGs. DEGs were identified with classical t-test, statistically significant DEGs were defined with p<0.05 and [logFC]>1 as the cut-off criterion. Table 1. 292 differentially expressed genes (DEGs) were identified from four profile datasets, including 165 up-regulated genes and 127 down-regulated genes in the colorectal cancer tissues, compared to normal colon tissues. (The up-regulated genes were listed from the largest to the smallest of fold changes, and down-regulated genes were listed from the smallest to largest of fold changes). Identification of 292 commonly changes DEGs from the four cohort profile data sets (GSE28000, GSE21815, GSE44076 and GSE75970) using Morpheus Website (Available online: https://software.broadinstitute.org).Different color areas represented different datasets. The cross areas meant the commonly changed DEGs. DEGs were identified with classical t-test, statistically significant DEGs were defined with p < 0.05 and [logFC] > 1 as the cut-off criterion. Table 1. 292 differentially expressed genes (DEGs) were identified from four profile datasets, including 165 up-regulated genes and 127 down-regulated genes in the colorectal cancer tissues, compared to normal colon tissues. (The up-regulated genes were listed from the largest to the smallest of fold changes, and down-regulated genes were listed from the smallest to largest of fold changes).

DEGs Gene Ontology Analysis in Colorectal Cancers
Candidate DEGs functions and pathways enrichment were analyzed using multiple online databases, including DAVID, KEGG PATHWAY (Available online: http://www.genome.jp/kegg), Reactome (Available online: http://www.reactome.org), BioCyc (Available online: http://biocyc.org), Panther (Available online: http://www.pantherdb.org) [12], NHGRI and Gene Ontology website with p < 0.05 as the cut-off criterion [13][14][15]. DEGs gene ontology analysis (GO) were performed with DAVID, Panther. The DEGs were classified into three functional groups: molecular function group, biological process group, and cellular component group ( Figure 2A). As shown in Figure 2B and Table 2, in the biological process group, up-regulated genes mainly enriched in single-organism process, single-organism cellular process, cell proliferation process and mitotic cell cycle process, and the down-regulated genes mainly enriched in localization, single-organism process, single-organism cellular process, mitotic cell cycle process. In the molecular function group, up-regulated genes mainly enriched in binding, receptor binding, protein binding, and the down-regulated genes mainly enriched in binding, protein binding. In the cellular component group, up-regulated genes mainly enriched in extracellular space, extracellular region, and the down-regulated genes mainly enriched in cell part, membrane-bounded organelle, organelle. These results showed that most of the DEGs were significantly enriched in single-organism, binding, cell part and mitotic cell cycle.

DEGs Gene Ontology Analysis in Colorectal Cancers
Candidate DEGs functions and pathways enrichment were analyzed using multiple online databases, including DAVID, KEGG PATHWAY (Available online: http://www.genome.jp/kegg), Reactome (Available online: http://www.reactome.org), BioCyc (Available online: http://biocyc.org), Panther (Available online: http://www.pantherdb.org) [12], NHGRI and Gene Ontology website with p<0.05 as the cut-off criterion [13][14][15]. DEGs gene ontology analysis (GO) were performed with DAVID, Panther. The DEGs were classified into three functional groups: molecular function group, biological process group, and cellular component group ( Figure 2A). As shown in Figure 2B and Table 2, in the biological process group, up-regulated genes mainly enriched in single-organism process, single-organism cellular process, cell proliferation process and mitotic cell cycle process, and the down-regulated genes mainly enriched in localization, single-organism process, single-organism cellular process, mitotic cell cycle process. In the molecular function group, up-regulated genes mainly enriched in binding, receptor binding, protein binding, and the down-regulated genes mainly enriched in binding, protein binding. In the cellular component group, up-regulated genes mainly enriched in extracellular space, extracellular region, and the down-regulated genes mainly enriched in cell part, membrane-bounded organelle, organelle. These results showed that most of the DEGs were significantly enriched in single-organism, binding, cell part and mitotic cell cycle. (A)

Signaling Pathway Enrichment Analysis
DEGs functional and signaling pathway enrichment were conducted using online websites of KEGG PATHWAY, Reactomen, BioCyc, Panther, NHGRI and Gene Ontology. The up-regulated genes mainly enriched in G α (i) signaling events, GPCR ligand binding, Chemokine receptors bind chemokines, Class A/1 (Rhodopsin-like receptors), and Cell Cycle signaling pathway. The down-regulated genes mainly enriched in Cell Cycle, Mitotic Prometaphase, Resolution of Sister Chromatid Cohesion, Aldosterone-regulated sodium reabsorption, and Chemokine receptors bind chemokines signaling pathway ( Figure 3, Table 3). Signaling pathway analysis showed that DEGs had common pathways in Cell Cycle, Chemokine receptors bind chemokines.

Signaling Pathway Enrichment Analysis
DEGs functional and signaling pathway enrichment were conducted using online websites of KEGG PATHWAY, Reactomen, BioCyc, Panther, NHGRI and Gene Ontology. The up-regulated genes mainly enriched in G α (i) signaling events, GPCR ligand binding, Chemokine receptors bind chemokines, Class A/1 (Rhodopsin-like receptors), and Cell Cycle signaling pathway. The down-regulated genes mainly enriched in Cell Cycle, Mitotic Prometaphase, Resolution of Sister Chromatid Cohesion, Aldosterone-regulated sodium reabsorption, and Chemokine receptors bind chemokines signaling pathway ( Figure 3, Table 3). Signaling pathway analysis showed that DEGs had common pathways in Cell Cycle, Chemokine receptors bind chemokines.     Using the STRING online database (Available online: http://string-db.org) [16] and Cytoscape software [17], total of 180 DEGs (113 up-regulated and 67 down-regulated genes) of the 292 commonly altered DEGs were filtered into the DEGs PPI network complex, containing 180 nodes and 518 edges ( Figure 4A), and 112 of the 292DEGs did not fall into the DEGs PPI network. Among the 180 nodes, 31 central node genes (bold in Table 1) were identified with the filtering of degree >10 criteria (i.e., each node had more than 10 connections/interactions) , and the most significant 10 node degree genes were CDK1, CCNB1, CENPE, KIF20A, CXCL12, DLGAP5, CCNA2, ITGA2, MAD2L1 and NMU. According the degree of importance, we chose 2 significant modules from the PPI network complex for further analysis using Cytotype MCODE. Pathway enrichment analysis showed that Module 1 consisted of 17 nodes and 109 edges ( Figure 4B, Table 4), which are mainly associated with cell cycle process, sister chromatid and segregation, and that Module 2 consisted of 14 nodes and 91 edges ( Figure 4C, Table 5), which are mainly associated with chemokine signaling pathway, Gα (i) signaling events, and GPCR ligand binding pathway. edges ( Figure 4C, Table 5), which are mainly associated with chemokine signaling pathway, Gα (i) signaling events, and GPCR ligand binding pathway.

Validation of the DEGs in TCGA Dataset
To confirm the reliability of the identified DEGs from the 4 datasets, we downloaded the TCGA CRC dataset (including 22 normal and 122 colon cancers) and analyzed the data using the same strategy as used in this report. We found that all the 165 up-regulated genes identified in this study were also significantly overexpressed in the TCGA colon cancers (See Figure S2A and Table S1), and 114 down-regulated genes identified in this study were also found significantly under-expressed in the TCGA colon cancers (See Figure S2B and Table S1), only 13 of the downregulated genes were not in the list. The total consistency of the up-and down-regulated genes was 95.5%, suggesting our results of the identified candidate genes are reliable.

Discussion
Numerous basic and clinical studies have been conducted to reveal the causes and underlying mechanisms of colorectal cancer formation and progression in the past several decades, but the incidence and mortality of CRC is still very high in the world, because most studies focus on a single genetic event or the results are generated from a single cohort study [18]. This study integrated four cohorts profile datasets from different groups, utilized bioinformatics methods to deeply analyze these datasets, and identified 292 commonly changed DEGs(165 up-regulated and 127 down-regulated) in the first step. In the second step, the 292 EDGs were classified into three groups (molecular functions, biological process and cellular component groups) by GO terms using multiple approaches.In the third step, the up-and down-regulated DEGs were further clustered based on functions and signaling pathways with significant enrichment analysis.In the fourth step, DEGs protein-protein interaction (PPI) network complex was developed and 180 nodes/DEGs were identified with 518 edges, and finally, the most significant 2 modules were filtered from the PPI network complex, among them, 31 central node genes were identified and most of the corresponding genes were associated with cell cycle process, Chemokines and G protein-coupled receptor signaling pathways.
It has been known that Wnt/β-catenin activation and malignant transformation of inflammatory bowel disease are the two major causes of colorectal cancer. Both Wnt/β-catenin [19,20] and inflammatory signaling pathway activation [21] lead to intestinal epithelial disruption ofhomeostasis, for instance, proliferation is increased, differentiation and apoptosis are decreased, in the intestinal tract.Through integrated bioinformatical analysis, we have identified 31 central node/genes, among them, the first module or cluster ( Figure 4B) consisting of 17 genes, including CDK1, CCNB1, CENPE, KIF20A, CCNA2, MAD2L1, were listed at the top of the most changed genes, and their biological functions are involved in cell cycle control and mitotic regulation [22]. CCNB1 and CCNA2 are cyclins B and A, and CDK1 is cyclin-dependent kinase 1, all three of them are indirect downstream targets of β-catenin and exhibit opposite functions with p21CIP1 and p27KIP1 to promote cycle and mitosis [23,24]. In contrast, p21CIP1 and p27KIP1 are cyclin-dependent kinase inhibitors and delay cell cycle and mitosis, and both p21CIP1 and p27KIP1 bind to cyclin D1, a direct downstream target of β-catenin [23,24]. Besides the functions of promoting cell cycle, mitosis and cell proliferation to initiate colorectal cancer development, Wnt/β-catenin signaling pathway also involves in epithelial-mesenchymaltransition (EMT), the later is an early event of invasion and metastasis, through up-regulating EMT-associated transcription factors Snail, Twist and ZEB1 [4,25]. In addition, Wnt/β-catenin signaling keeps the stemness of cancer stem cells via up-regulating stem cell-related transcription factors Oct4, Sox2, c-Myc, Nanog and KLF4 [4,25]. Cancer stem cells are the main causes of cancer cell proliferation, metastasis and therapeutic resistance.
Besides the chemokine signaling in module/cluster 2 ( Figure 4C), Gα (i) signaling and G-protein coupled receptor (GPCR) signaling pathway is also in this module. Previous studies have demonstrated that GPCR is associated with metabolism, and the abnormal metabolism leads to chronic inflammation in the gut and oncogenic signaling activation [21,31]. In addition, GPCR participates in Rho-GTPase pathway and modulates cell-cell contact proteins, affects cell morphology and mobility [32,33]. Recent studies have also shown that aberrant expression of GPCR up-regulate the expression of intestine-specific stem cell marker LGR5 [34], causing colorectal carcinogenesis in vivo and chemo-resistance in vitro.
Consistent with our studies, the recent studies have also reported the identification of DEGs in colorectal cancers [35,36]. For examples, Yan's groups analyzed gene expression profile GSE32323 that contained 34 samples, including 17 specimens of CRC tissues and 17 of paired normal tissues from CRC patient. From which, 1347 DEGs were identified, including 659 upregulated genes and 688 downregulated genes, the majority of the identified DEGs were mainly enriched in cell cycle-related biological processes and pathways, such as CDK1, CCNB1, MAD2L1, BUB1B, etc. These DEGs were as also identified in our study. We wanted to point out that Yan's study was based on a single dataset generated from Japanese patients, and the main reason for Japanese colorectal carcinogenesis is the alteration of cell cycle-related biological processes and pathways [37]. Different from Yan's report, we used 4 datasets generated from the African American, European American, Spanish, Japanese and Chinese, represented 5 different races of world-wide populations, and we found that besides the cell cycle-related biological processes and pathways, inflammatory pathway (CXCL-CXCR), another major reason for colorectal carcinogenesis, was identified, because one of the major causes of CRC in European Americans and Spanish is malignant transformation of chronic colitis [21].
Another report from Kou et al. was based on the single dataset GSE4107 generated from 12 CRC patients and ten healthy controls [36,38]. A total of 612 up-and 639 downregulated genes were identified. The upregulated DEGs were mainly involved in the regulation of cell growth, migration, and the MAPK signaling pathway. The downregulated DEGs were significantly associated with oxidative phosphorylation, Alzheimer's disease, and Parkinson's disease. It should be noticed that these CRC patients were from early onset CRC withnon-FAP (familial adenomatous polyposisnon) or non-hereditary non-polyposis and all patients were young (age < or = 50 years old), ethnicity-(Chinese), and tissue-matched [38]. One of the pathways, that is G protein-coupled receptor pathway, was also identified in our study, even though our conclusion was generated from 4gene expression profile datasets from middle-or late-stage with very few early stages of CRC.
Colorectal cancer is a group of histologically and molecularly heterogeneous diseases characterized by differing sets of epigenetic and genetic alterations involved in multiple functional signaling pathways, and the later is modulated by genetic and epigenetic events, leading to the alterations of gene expression at transcriptional and/or translational levels. Therefore, the characteristics of CRC cannot be explained only by gene expression profiles, but gene expression profile could represents one type of alterations in cancer or characteristics of cancers. Many factors should be considered, including gene mutations, hypo-or hyper-methylation, microRNAs and long non-coding RNAs, MSI, etc., which could cause or participate in colorectal carcinogenesis and are associated with survival in part or together. For instance, the association of LINE-1 hypomethylation with inferior survival is stronger in MSI-high CRCs than in MSS CRCs [39]. Moreover, the molecular classification of cancer types or subtypes have been identified according to the molecular changes (e.g., gene expression, epigenetic changes, etc.). In the case of colorectal cancer, the CRC could be classified into adenocarcinoma, mucinous adenocarinoma, signet-ring cell carcinoma, according to cell components of the CRC tissues, and the components of the CRC could have important prognostic significance. As reported by Inamura et al., even a minor (50% or less) signet-ring cell component in CRC was associated with higher patient mortality, independent of various tumor molecular and other clinicopathological features. In contrast, mucinous component was not associated with mortality in CRC patients [40]. However, the molecular changes in each subtype of CRC have not been completely understood. For example, TCGA CRC data has 20 mucinous carcinomas, but there is no significant difference of gene expression compared to the adenocarcinomas.
Taken above, using multiple cohorts profile datasets and integrated bioinformatical analysis, we have identified 292 DEGs candidate genes at screening step, and filtered 180 gene nodes in DEGs protein-protein interaction network complex, and finally found 31 mostly changed hub genes, which significant enriched in several pathways, mainly associated with cell cycle process, chemokines and G-protein coupled receptor signaling pathways in colorectal cancer. These findings could significantly improve our understanding of the cause and underlying molecular events in CRC, these candidate genes and pathways could be therapeutic targets for CRC.

Microarray Data Information and DEGs Identification
NCBI-GEO is a free database of microarray/gene profile and next-generation sequencing, from which colorectal cancer and normal or adjacent mucosa tissue gene expression profile of GSE28000, GSE21815, GSE44076 and GSE75970 were obtained. The microarray data of GSE28000 was based on GPL4133 Platforms (Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F, Agilent Technologies, Santa Clara, CA, USA) and included 43 American Africans (AA) and 43 European Africans (EA) CRC tissues and 40 normal colon tissues (Submission date: 16 March 2011) [7]. The GSE21815 data was based on GPL6480 Platforms (Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F, Probe Name version, Agilent Technologies) and included 131 CRC tissue and 9 normal colon tissue (Submission date: 13 May 2010) [8,9]. The GSE44076 data was based on GPL13667 Platforms (Affymetrix Human Genome U219 Array, Affymetrix, Santa Clara, CA, USA) and included 98 CRC tissues and 50 healthy colon mucosa (Submission date: 5 Feburary 2013) [10,11]. The GSE75970 data was based on GPL14550 Platforms (Agilent-028004 SurePrint G3 Human GE 8 × 60K Microarray, Agilent Technologies) and included 4 pairs of colorectal cancer tissues and matched paraneoplastic tissues (Submission date: 14 December 2015). We chose these 4 datasets for integrated analysis in this study because these 4 datasets represented five different racial populations: the GSE28000 was generated from the Americans (American Africans and European Africans), the GSE21815 was conducted in Japanese, the GSE44076 data was from Spanish, and the GSE75970 data was conducted in Chinese. This project was approved by Institutional Review Board of Jnining Medical University, and the Code was "JNMC-IRB-20170109". All procedures of this study were complied with the protocol.
The raw data of high throughput functional genomic expression were integrated for the analysis, the TXT format data were processed in Morpheus Website (Available online: https://software. broadinstitute.org). DEGs were identified with classical t test, statistically significant DEGs were defined with p < 0.05 and [logFC] > 1 as the cut-off criterion.
For validation, The Cancer Genome Atlas (TCGA) (Available online: https://cancergenome. nih.gov/) colon cancer data was downloaded and analyzed. The TCGA colon cancer data included 22 normal and 122 colon cancers.

Integration of Protein-Protein Interaction (PPI) Network, Modular Analysis and Significant Candidate Genes and Pathway Identification
First, online database STRING (Available online: http://string-db.org) [16] was employed to develop DEGs-encoded proteins and protein-protein interaction network (PPI). Second, Cytoscape software [17] was utilized to construct protein interaction relationship network and analyze the interaction relationship of the candidate DEGs encoding proteins in colon cancer. Third, the Network Analyzer plug-in was used to calculate node degree, that was, the numbers of inter-connections to filter hub genes of PPI. The corresponding proteins in the central nodes might be the core proteins and key candidate genes that have important physiological regulatory functions.

Conclusions
Using multiple cohorts profile datasets and integrated bioinformatical analysis, we have identified commonly changed 292 DEGs candidate genes, and finally found 31 mostly changed hub genes, which significant enriched in several pathways, mainly associated with cell cycle process, chemokines and G-protein coupled receptor signaling pathways in colorectal cancer. These findings significantly improve the understanding of the cause and underlying molecular events in colorectal cancer, and the candidate genes and pathways could be used as therapeutic targets.