Exploring Core Genes by Comparative Transcriptomics Analysis for Early Diagnosis, Prognosis, and Therapies of Colorectal Cancer

Simple Summary Colorectal cancer (CRC) is a complex disease that has a high mortality rate. This study explored CRC-related core genes (CGs) from multiple microarray gene-expression profiles in the NCBI-GEO database by combining some statistics and bioinformatics techniques. It also disclosed their molecular functions, biological processes, cellular components, signaling pathways, and transcriptional and post-transcriptional regulatory factors by using different online bioinformatics tools and databases. The prognostic power of CGs was investigated from the independent TCGA database by using survival probability curves and box plots of CGs-expressions in different stages (control, Stage 1, Stage 2, Stage 3, and Stage 4) of CRC. Finally, a few CGs-guided drug molecules were suggested for the treatment of CRC by molecular docking and dynamic simulation studies. Therefore, the findings of this study would be useful resources for early diagnosis, prognosis, and therapies of CRC. Abstract Colorectal cancer (CRC) is one of the most common cancers with a high mortality rate. Early diagnosis and therapies for CRC may reduce the mortality rate. However, so far, no researchers have yet investigated core genes (CGs) rigorously for early diagnosis, prognosis, and therapies of CRC. Therefore, an attempt was made in this study to explore CRC-related CGs for early diagnosis, prognosis, and therapies. At first, we identified 252 common differentially expressed genes (cDEGs) between CRC and control samples based on three gene-expression datasets. Then, we identified ten cDEGs (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) as the CGs, highlighting their mechanisms in CRC progression. The enrichment analysis of CGs with GO terms and KEGG pathways revealed some crucial biological processes, molecular functions, and signaling pathways that are associated with CRC progression. The survival probability curves and box-plot analyses with the expressions of CGs in different stages of CRC indicated their strong prognostic performance from the earlier stage of the disease. Then, we detected CGs-guided seven candidate drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) by molecular docking. Finally, the binding stability of four top-ranked complexes (TPX2 vs. Manzamine A, CDC20 vs. Cardidigin, MELK vs. Staurosporine, and CDK1 vs. Riccardin D) was investigated by using 100 ns molecular dynamics simulation studies, and their stable performance was observed. Therefore, the output of this study may play a vital role in developing a proper treatment plan at the earlier stages of CRC.

. The pipeline of this study.

Data Sources and Descriptions
The necessary datasets that were analyzed in this study are described below.

Collection of Microarray Datasets to Explore CRC-Causing Core Genes
We downloaded three microarray datasets of CRC using the accession IDs GSE106582, GSE110223, and GSE74602 from the NCBI Gene Expression Omnibus (GEO) database. The GPL10558 platform was used for the GSE106582 dataset, which contains 194 samples, including 77 cancer and 117 adjacent tissue samples. The GPL96 platform was used for the GSE110223 dataset, which contains 26 samples, including 13 cancer and 13 adjacent tissue samples. The GPL6104 platform was used for the GSE74602 dataset, which contains 60 samples, including 30 cancer and 30 adjacent tissue samples.

Collection of Drug Molecules Set for Drug Repositioning
For identifying candidate repurposed drugs, we collected target receptor-guided drug molecules from the DSigDB [23] database (Table S1).

Method for Identification of DEGs
We used the GEO2R web tool [24] based on the LIMMA (linear models for microarray data) approach to identify DEGs between cancer and adjacent tissue samples for each of the three datasets. The LIMMA method uses modified t-statistics to calculate p-values. We used the Benjamini-Hochberg (BH) approach [25] to adjust the p-values. The log2 foldchange (Log2FC) and adjusted p-values were used to separate the up-and down-regulated DEGs by the following cut-offs:

Data Sources and Descriptions
The necessary datasets that were analyzed in this study are described below.

Collection of Microarray Datasets to Explore CRC-Causing Core Genes
We downloaded three microarray datasets of CRC using the accession IDs GSE106582, GSE110223, and GSE74602 from the NCBI Gene Expression Omnibus (GEO) database. The GPL10558 platform was used for the GSE106582 dataset, which contains 194 samples, including 77 cancer and 117 adjacent tissue samples. The GPL96 platform was used for the GSE110223 dataset, which contains 26 samples, including 13 cancer and 13 adjacent tissue samples. The GPL6104 platform was used for the GSE74602 dataset, which contains 60 samples, including 30 cancer and 30 adjacent tissue samples.

Collection of Drug Molecules Set for Drug Repositioning
For identifying candidate repurposed drugs, we collected target receptor-guided drug molecules from the DSigDB [23] database (Table S1).

Method for Identification of DEGs
We used the GEO2R web tool [24] based on the LIMMA (linear models for microarray data) approach to identify DEGs between cancer and adjacent tissue samples for each of the three datasets. The LIMMA method uses modified t-statistics to calculate p-values. We used the Benjamini-Hochberg (BH) approach [25] to adjust the p-values. The log 2 fold-change (Log 2 FC) and adjusted p-values were used to separate the up-and down-regulated DEGs by the following cut-offs: We considered common DEGs (cDEGs) from three datasets to identify core CGs. All DEGs were visualized using a Venn diagram using FunRich 3.1.3 [26].

Protein-Protein Interaction (PPI) Network Analysis
The protein-protein interaction (PPI) network was utilized to detect core genes (CGs) from cDEGs. We considered the STRING database [27] with a median confidence score (MCS) of 0.4 to produce a PPI network of cDEGs and Cytoscape software for better visualization of the network [28]. The CGs were selected from the PPI network using the CytoHubba plugin in Cytoscape [28,29]. The present study used maximal clique centrality (MCC) topology analysis methods to identify the CGs.

Association of CGs with Different Stages of CRC Progression
To investigate the association of CGs with the different stages of CRC based on independent databases, we performed box-plot analysis based on their expression levels in different CRC progression stages (Normal status, Stage 1, Stage 2, Stage 3, and Stage 4) through the UALCAN web tool with the TCGA-COAD and TCGA-READ databases [30,31].

Prognosis Power of CGs
To investigate the prognosis power of CGs by multivariate Kaplan-Meier survival probability curves, we considered the SurvExpress web tool based on the TCGA-COAD and TCGA-READ databases (https://portal.gdc.cancer.gov/exploration, accessed date: 2 January 2022) [30,32]. The log-rank test was used in SurvExpress, and the risk group hazard ratio with a 95% confidence interval was included in the Kaplan-Meier survival plot [32]. The p-value < 0.05 was used as the cut-off.

CGs-Set Enrichment Analysis
CGs-set enrichment analysis (cGSEA) determines the classes of genes or proteins that are over-represented (enriched) in a predefined large set of genes or proteins that are associated with the terms of interest, including gene ontology (GO), pathways, diseases, chemicals, drugs, biomolecules (miRNA, TFs), and so on. To detect the significantly enriched terms of interest, let A i be the predefined gene-set in the ith term of interest, and M i denotes the number of CGs in A i (i = 1, 2, . . . , r); T is considered as an enriched gene number that created a combined set A such that where A c i is considered as the complement-set of A i . Again, suppose t represents the number of CGs and m i denotes the number of CGs subset of A i . Table 1 summarizes these results.
The Enrichr web tool [33] was considered to investigate the association of CGs with terms of interest. This web tool uses the Fisher exact test to examine the significance of the association between CGs and ith term of interest.

Association of CGs with Different Diseases
We considered the Enrichr web tool [33] to verify the association of CGs with different diseases, including CRC, using the DisGeNET database, which was constructed based on 21,671 genes and 30,170 diseases [34]. It measures the association of a disease with a group of CGs that are overlapped (common) with the reference gene set of that disease (see Table 1). To investigate the pan-cancer role of CGs, we performed the pan-cancer analysis of each CG by the TIMER 2.0 web tool [35] with the TCGA database [36]. In both cases, the p-value < 0.05 was selected as the cut-off for statistical significance.

Association of CGs with GO Terms and KEGG Pathway
To disclose the pathogenetic processes of CGs, we performed CGs-set enrichment analysis with GO terms and pathways by using the Enrichr web tool [33]. Biological process (BPs), molecular function (MFs), and cellular component (CCs) were investigated to explore potential GO terms and pathways based on the KEGG database as displayed in Table 1. A p-value < 0.001 was used as the cut-off for the statistical significance.

CGs Regulatory Network Analysis
A gene regulatory network (GRN) provides information about molecular regulators that connect to regulate the gene expression level of mRNA. Transcription factors (TFs) and microRNAs (miRNAs) are considered the major regulators of gene expression. TFs proteins are regarded as the significant contributors to GRN because they bind to a particular region of DNA (enricher/promoter) and influence gene expression at the transcriptional level. A miRNA is a non-coding RNA considered a central post-transcriptional regulator of gene expression. The human genome contains up to 1600 TFs and 1900 miRNAs. A TFs vs. CGs network is considered an undirected graph, where nodes represent TFs or CGs and edges depict interactions between TFs and CGs, respectively. A TF-node is considered the major regulatory factor for CGs if it contains the largest number of interactions with CG nodes. We considered regulatory analysis of CGs (transcription factors (TFs) vs. CGs and micro-RNAs (miRNAs) vs. CGs) through Network Analyst [37] platform-based JASPAR [38] and TarBase [39] databases, respectively, to detect the core transcriptional and post-transcriptional regulators of CGs. For better illustration, we used Cytoscape software [28]. The core regulators were chosen by utilizing degree [40] and betweenness [41] scores.

Molecular Docking
We conducted molecular docking studies of receptors and drug molecules to explore FDA-approved repurposable drugs for CRC. CGs-mediated proteins and related TFs proteins were considered drug target receptors (p = 14). The online database DSigDB was used to extract CGs-guided drug agents. The 3-dimensional (3D) structures of AURKA, TOP2A, CDK1, PTTG1, CDC20, MAD2L1, CKS2, MELK, TPX2, YY1, and SRF targets were retrieved from the Protein Data Bank (PDB) [42] with IDs 6VPM, 5NNE, 5LQF, 7NJ1, 4GGC, 2V64, 5LQF, 5M5A, 6VPM, 4C5I, and 1HBX, respectively. The remaining targets, FOXC1, CDKN3, and NFIC, were retrieved from the SWISS-MODEL [43] with the Uniport IDs Q12948, P08651, and Q16667, respectively. Using the PubChem database [44], we retrieve the 3D structures of all (q = 158) drug molecules. The visualization of the receptor proteins and co-crystal ligands were performed via the Discovery Studio Visualizer 2019 [27]. The receptor proteins were processed using AutoDock tools [45] and the Swiss PDB viewer by adding the structural charges and reducing the energy of receptors, respectively [45,46]. The docked complexes were analyzed through Discovery Studio Visualizer 2019. Let B ij be the binding affinity (BA) score of ith receptors (i = 1, 2, . . . , p) and jth drugs (j = 1, 2,..., q). The receptors and drug molecules were sorted by the decreasing order of their average BA score for selecting the top-ordered few potential candidate repurposable drugs.

Molecular Dynamics (MD) Simulation
We performed MD simulations of the top-orderedprotein-ligand complexes (TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D) through the YASARA software (Version: 22.8.22) [47] based on the AMBER14 force field [48]. Prior to simulation, the hydrogen-bonding network of each complex in a simulated cell was optimized using a TIP3P water model [45]. The periodic limit conditions were kept constant at 0.997 gL-1 of solvent concentration. The primary energy was minimized in each simulation by considering the steepest gradient technique with 5000 cycles. The complexes "TPX2-Manzamine A", "CDC20-Cardidigin", "MELK-Staurosporine", and "CDK1-Riccardin D" consist of a total of 56,287, 35,859, 81,347, and 45,153 atoms, respectively. At the Berendsen thermostat [49] and constant pressure, a 100 ns MD simulation was examined. Please see our previous publications for details about the MD simulation strategy [50][51][52][53]. For subsequent analysis, we took snapshots of the trajectories every 250 ps, ran them via the built-in script of YASARA [54] macro, and calculated the binding free energy of the MM-Poisson-Boltzmann surface area (MM-PBSA) by analyzing all the snapshots [55]. To calculate binding-free energy, we used the following formula: More positive binding energy represents stronger binding.

Identification of DEGs
To identify DEGs from each of three microarray gene-expression datasets, we used the statistical LIMMA approach through the GEO2R web tool, with the cut-off at adjusted p-value < 0.05 and |log 2 (fold change)| > 1. In the GSE106582 dataset, we identified 594 DEGs, including 213 upregulated and 381 downregulated genes. In the GSE110223 dataset, we identified 625 DEGs that contain 260 upregulated and 365 downregulated genes. In the GSE74602 dataset, we identified 1674 DEGs, including 673 upregulated and 1001 downregulated genes. The Venn diagram in Figure 2 visualizes the common DEGs among the three datasets. The Venn diagram exhibits 252 cDEGs among the three datasets.

Molecular Dynamics (MD) Simulation
We performed MD simulations of the top-orderedprotein-ligand complexes (TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D) through the YASARA software (Version: 22.8.22) [47] based on the AMBER14 force field [48]. Prior to simulation, the hydrogen-bonding network of each complex in a simulated cell was optimized using a TIP3P water model [45]. The periodic limit conditions were kept constant at 0.997 gL-1 of solvent concentration. The primary energy was minimized in each simulation by considering the steepest gradient technique with 5000 cycles. The complexes "TPX2-Manzamine A", "CDC20-Cardidigin", "MELK-Staurosporine", and "CDK1-Riccardin D" consist of a total of 56,287, 35,859, 81,347, and 45,153 atoms, respectively. At the Berendsen thermostat [49] and constant pressure, a 100 ns MD simulation was examined. Please see our previous publications for details about the MD simulation strategy [50][51][52][53]. For subsequent analysis, we took snapshots of the trajectories every 250 ps, ran them via the built-in script of YASARA [54] macro, and calculated the binding free energy of the MM-Poisson-Boltzmann surface area (MM-PBSA) by analyzing all the snapshots [55]. To calculate binding-free energy, we used the following formula: More positive binding energy represents stronger binding.

Identification of DEGs
To identify DEGs from each of three microarray gene-expression datasets, we used the statistical LIMMA approach through the GEO2R web tool, with the cut-off at adjusted p-value < 0.05 and |log2(fold change)| > 1. In the GSE106582 dataset, we identified 594 DEGs, including 213 upregulated and 381 downregulated genes. In the GSE110223 dataset, we identified 625 DEGs that contain 260 upregulated and 365 downregulated genes. In the GSE74602 dataset, we identified 1674 DEGs, including 673 upregulated and 1001 downregulated genes. The Venn diagram in Figure 2 visualizes the common DEGs among the three datasets. The Venn diagram exhibits 252 cDEGs among the three datasets.

Identification of Core Genes (CGs)
We construct the PPI network of cDEGs and visualize the PPI network to identify the potential genes most significantly associated with the development of CRC. The PPI Figure 2. Common differentially expressed genes among GSE110223, GSE106582, and GSE74602 datasets were visualized through a Venn diagram, and 252 genes were found as cDEGs from CRC patients.

Identification of Core Genes (CGs)
We construct the PPI network of cDEGs and visualize the PPI network to identify the potential genes most significantly associated with the development of CRC. The PPI network contains 216 nodes and 616 edges, with a confidence score of 0.40. Then, the MCC topology analysis method of CytoHubba was performed to calculate the top-ranked CGs within the network. We found the ten top-ranked (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) genes (see Figure 3). These top ten CGs were identified as major controllers of CRC and considered for subsequent analysis. network contains 216 nodes and 616 edges, with a confidence score of 0.40. Then, the MCC topology analysis method of CytoHubba was performed to calculate the top-ranked CGs within the network. We found the ten top-ranked (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) genes (see Figure 3). These top ten CGs were identified as major controllers of CRC and considered for subsequent analysis.

Association of CGs with Different Stages of CRC Progression
The box-plot analysis based on an independent database represents a high difference between normal expression and every CRC progression stage (Stage 1, Stage 2, Stage 3, and Stage 4) expression of all CGs (see Figures 4A and S1). So, our proposed CGs have strong prognostic power to identify CRC at an earlier development stage.

Prognosis Power of CGs
A survival analysis was performed to examine the prognosis power of CGs. The multivariate Kaplan-Meier survival plot of CGs expressions using the TCGA-COAD and TCGA-READ databases represents a significant difference (p-value < 0.001) between lower-risk and higher-risk groups (see Figure 4B).

Association of CGs with Different Diseases
The enrichment analysis of the CGs-set with different diseases based on the Dis-GeNET database showed that the CGs-set is significantly associated with various diseases (p-value < 0.05). Figure 3A and Table S2 show the top-ranked 20 diseases, all of which are different types of cancer, including CRC. We observed that 3 CGs (MELK, CKS2, CDC20) and 2 CGs (MELK, CDC20) do not overlap with the reference gene-sets of colon carcinoma and colorectal carcinoma, respectively (Figures 5A and S2, and Table S2). These results suggested a pan-cancer role for CGs ( Figure S3). To investigate the pan-cancer role of CGs, we also performed pan-cancer analysis based on the TCGA database. We selected the topranked 20 cancers as displayed in Figure 5B. Figure 5A,B commonly showed that eight cancers (colon adenocarcinoma, bladder urothelial carcinoma, esophageal carcinoma,

Association of CGs with Different Stages of CRC Progression
The box-plot analysis based on an independent database represents a high difference between normal expression and every CRC progression stage (Stage 1, Stage 2, Stage 3, and Stage 4) expression of all CGs (see Figures 4A and S1). So, our proposed CGs have strong prognostic power to identify CRC at an earlier development stage.

Prognosis Power of CGs
A survival analysis was performed to examine the prognosis power of CGs. The multivariate Kaplan-Meier survival plot of CGs expressions using the TCGA-COAD and TCGA-READ databases represents a significant difference (p-value < 0.001) between lowerrisk and higher-risk groups (see Figure 4B).

Association of CGs with Different Diseases
The enrichment analysis of the CGs-set with different diseases based on the Dis-GeNET database showed that the CGs-set is significantly associated with various diseases (p-value < 0.05). Figure 3A and Table S2 show the top-ranked 20 diseases, all of which are different types of cancer, including CRC. We observed that 3 CGs (MELK, CKS2, CDC20) and 2 CGs (MELK, CDC20) do not overlap with the reference gene-sets of colon carcinoma and colorectal carcinoma, respectively ( Figure 5A and Figure S2, and Table S2). These results suggested a pan-cancer role for CGs ( Figure S3). To investigate the pan-cancer role of CGs, we also performed pan-cancer analysis based on the TCGA database. We selected the top-ranked 20 cancers as displayed in Figure 5B. Figure 5A,B commonly showed that eight cancers (colon adenocarcinoma, bladder urothelial carcinoma, esophageal carcinoma, glioblastoma multiforme, liver hepatocellular carcinoma, lung adenocarcinoma, prostate adenocarcinoma, and stomach adenocarcinoma) are significantly associated with CGs. glioblastoma multiforme, liver hepatocellular carcinoma, lung adenocarcinoma, prostate adenocarcinoma, and stomach adenocarcinoma) are significantly associated with CGs.

Association of CGs with GO Terms and KEGG Pathway
Enrichment analysis of the CGs was performed using the Enrichr web tool. Table 2 shows the annotated GO terms in three categories (BPs, CCs, and MFs). In the case of biological processes (BPs), CGs were mainly involved in mitotic cell-cycle phase transition

Association of CGs with GO Terms and KEGG Pathway
Enrichment analysis of the CGs was performed using the Enrichr web tool. Table 2 shows the annotated GO terms in three categories (BPs, CCs, and MFs). In the case of biological processes (BPs), CGs were mainly involved in mitotic cell-cycle phase transition (GO:0044772), anaphase-promoting complex-dependent catabolic process (GO:0031145), regulation of G2/M transition of mitotic cell cycle (GO:0010389), mitotic spindle organi-zation (GO:0007052), regulation of mitotic cell cycle (GO:0007346). In molecular function (MFs), CGs were mainly involved in histone kinase activity (GO:0035173), RNA polymerase II CTD heptapeptide repeat kinase activity (GO:0008353), protein kinase binding (GO:0019901), CXCR chemokine receptor binding (GO:0045236), cyclin-dependent protein kinase activity (GO:0097472), etc. In cellular components (CCs), CGs were mainly involved in the spindle (GO:0005819), cyclin-dependent protein kinase holoenzyme complex (GO:0000307), serine/threonine-protein kinase complex (GO:1902554), intracellular non-membrane-bounded organelle (GO:0043232), mitotic spindle (GO:0072686), etc. The KEGG pathway enrichment analysis results for CGs were also shown in Table 2. The KEGG pathways of CGs were enriched in the cell cycle, bladder cancer, oocyte meiosis, human T-cell leukemia virus one infection, progesterone-mediated oocyte maturation, etc.

Identification of Regulatory Factors
TFs proteins and miRNAs play a fundamental role in the modification of gene expression at the transcriptional and post-transcriptional levels, respectively. To explore the major transcriptional regulatory factors of CGs, we constructed a TFs vs. CGs interaction network where round nodes with red color represent the CGs and square nodes with green/purple color represent the TFs (see Figure 6A). TFs proteins vs. CGs regulatory analysis revealed four highest-ranking significant candidate TFs modifiers (NFIC, FOXC1, YY1, and GATA2) that may regulate the expression of CGs at the transcriptional level (see Figure 6A). Similarly, we constructed an undirected interaction network of miRNAs vs. CGs to reveal the post-transcriptional regulator of CGs, where red color nodes represent the CGs and green/blue color nodes illustrate the miRNAs (see Figure 6B). The miRNAs vs. CGs regulatory network analysis revealed six highly interacted non-coding RNAs (miRNAs) such as hsa-mir-147a, hsa-mir-129-2-3p, hsa-mir-124-3p, hsa-mir-34a-5p, hsa-mir-23b-3p, and hsa-mir-16-5p that act as gene expression regulators at the post-transcriptional level (see Figure 6B). So, those identified TFs and miRNAs may influence the gene expression of CGs at the transcriptional and post-transcriptional levels, respectively.
analysis revealed four highest-ranking significant candidate TFs modifiers (NFIC, FOXC1, YY1, and GATA2) that may regulate the expression of CGs at the transcriptional level (see Figure 6A). Similarly, we constructed an undirected interaction network of miRNAs vs. CGs to reveal the post-transcriptional regulator of CGs, where red color nodes represent the CGs and green/blue color nodes illustrate the miRNAs (see Figure 6B). The miRNAs vs. CGs regulatory network analysis revealed six highly interacted non-coding RNAs (miRNAs) such as hsa-mir-147a, hsa-mir-129-2-3p, hsa-mir-124-3p, hsa-mir-34a-5p, hsamir-23b-3p, and hsa-mir-16-5p that act as gene expression regulators at the post-transcriptional level (see Figure 6B). So, those identified TFs and miRNAs may influence the gene expression of CGs at the transcriptional and post-transcriptional levels, respectively.

Drug Repurposing through Molecular Docking Studies
We considered 10 CG-guided proteins and 4 TFs proteins as target-receptor proteins for molecular docking. A structural interaction was carried out between target-receptor proteins and 158 drug agents by molecular docking studies, which computed the receptordrug binding affinities (BA) for each interaction (see Figure 7).

Drug Repurposing through Molecular Docking Studies
We considered 10 CG-guided proteins and 4 TFs proteins as target-receptor proteins for molecular docking. A structural interaction was carried out between target-receptor proteins and 158 drug agents by molecular docking studies, which computed the receptordrug binding affinities (BA) for each interaction (see Figure 7). The red colors text in the row represents proposed drug agents.

Molecular Dynamic (MD) Simulations
TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D complexes have shown the highest BA in molecular docking analysis (Table 3). So, we considered those complexes for examining their binding stability through MD simulations. We observed that each protein-ligand complex (TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D) showed significant stability in a 100 ns MM-PSSA simulation (see Figure 8A). RMSD values corresponding to each complex were calculated (see Figure 8A). RMSD values showed lower flexibility around 1.5 Å to 3.0 Å for all four complexes. The average RMSD values for TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D complexes were 2.592 Å, 1.724 Å, 2.235 Å, and 2.516 Å, respectively. The CDC20-Cardigin complex showed a more substantial structural rigidity than the other three complexes, gained equilibrium at four ns, and displayed good stability after that. MELK-Staurosporine showed a gradual increase in RMSD before 22 ns, and after this time point, the RMSD score of the complex illustrated almost stable movement between 2.2 Å and 2.50 Å for 68 ns. After that, there were irregular fluctuations in the RMSD. On the contrary, CDK1-Riccardin D complexes exhibited instability, and the RMSD displayed an upward trend from 2.0 Å to 3.2 Å over time. Similarly, TPX2-Manzamine A showed irregular oscillation in RMSD between 1.7 Å and 3.3 Å. In addition, the MM-PBSA binding energy for four complexes was also computed. Figure 8B   Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3.

Molecular Dynamic (MD) Simulations
TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D complexes have shown the highest BA in molecular docking analysis (Table 3). So, we considered those complexes for examining their binding stability through MD simulations. We observed that each protein-ligand complex (TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D) showed significant stability in a 100 ns MM-PSSA simulation (see Figure 8A). RMSD values corresponding to each complex were calculated (see Figure 8A). RMSD values showed lower flexibility around 1.5 Å to 3.0 Å for all four complexes. The average RMSD values for TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D complexes were 2.592 Å, 1.724 Å, 2.235 Å, and 2.516 Å, respectively. The CDC20-Cardigin complex showed a more substantial structural rigidity than the other three complexes, gained equilibrium at four ns, and displayed good stability after that. MELK-Staurosporine showed a gradual increase in RMSD before 22 ns, and after this time point, the RMSD score of the complex illustrated almost stable movement between 2.2 Å and 2.50 Å for 68 ns. After that, there were irregular fluctuations in the RMSD. On the contrary, CDK1-Riccardin D complexes exhibited instability, and the RMSD displayed an upward trend from 2.0 Å to 3.2 Å over time. Similarly, TPX2-Manzamine A showed irregular oscillation in RMSD between 1.7 Å and 3.3 Å. In addition, the MM-PBSA binding energy for four complexes was also computed. Figure 8B illustrates the binding energies of the complexes. On average, TPX2-Manzamine A, CDC20-Cardidigin, MELK-Staurosporine, and CDK1-Riccardin D complexes produced Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Amino Acids Hydrogen Bond

TPX2
Manzamine A Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Amino Acids Hydrogen Bond Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Amino Acids Hydrogen Bond Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Amino Acids Hydrogen Bond Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting Amino Acids Hydrogen Bond Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D were shown to have strong BA against all of the target receptors, and their average BA lies between −9.20 and −8.2 (kcal mol −1 ). Among those drugs, Manzamine A was shown to have the highest BA against almost every target protein, with an average BA of −9.2 kcal mol −1 . Therefore, we proposed those seven drugs (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) as candidate drug agents and displayed them in red color in Figure 7. We also revealed the structural interaction profiles of four top-ordered receptor proteins and drug complexes in Table 3. Table 3. The 1st, 2nd, 3rd column show potential targets, 2-dimentional(2d) structure of lead compounds, top ordered binding affinities (kcal mol −1 ), respectively. The 3-dimension(3d) view of top ranking drug-target complexes is shown in the 4th column. Finally, the last column shows key elements of interacting amino acids, including hydrogen bond, hydrophobic interactions, and electrostatic.

Structure of Top Compounds
Binding Affinity Score (kcal mol −1 )

3D Structures of Complex with Interactions
Interacting

Discussion
To explore core genomic biomarkers and their mechanisms in the CRC progression, firstly, we identified 252 cDEGs between CRC and control samples, out of around 40,000 genes in three gene expression datasets of the NCBI-GEO database with accession numbers GSE106582, GSE110223, and GSE74602. Though the number of DEGs is much smaller than the total number of genes, it may still be a large number for further investigation by wet-lab experiments since it would be laborious, time-consuming, and costly. Therefore, a smaller set of DEGs that are representative of all DEGs is required to reduce time, cost, and labor during further experiments by the wet-lab researchers. Though total DEGs are more informative than any smaller set of DEGs, a smaller representative set of DEGs would be more beneficial from the viewpoints of time, cost, and labor. Therefore, in this study, we proposed 10 top-ranked DEGs (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) as the core genes (CGs) for early diagnosis, prognosis, and therapies of CRC. The survival probability curves and box-plot analyses

Discussion
To explore core genomic biomarkers and their mechanisms in the CRC progression, firstly, we identified 252 cDEGs between CRC and control samples, out of around 40,000 genes in three gene expression datasets of the NCBI-GEO database with accession numbers GSE106582, GSE110223, and GSE74602. Though the number of DEGs is much smaller than the total number of genes, it may still be a large number for further investigation by wet-lab experiments since it would be laborious, time-consuming, and costly. Therefore, a smaller set of DEGs that are representative of all DEGs is required to reduce time, cost, and labor during further experiments by the wet-lab researchers. Though total DEGs are more informative than any smaller set of DEGs, a smaller representative set of DEGs would be more beneficial from the viewpoints of time, cost, and labor. Therefore, in this study, we proposed 10 top-ranked DEGs (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) as the core genes (CGs) for early diagnosis, prognosis, and therapies of CRC. The survival probability curves and box-plot analyses with the expressions of CGs in different stages (control, Stage 1, Stage 2, Stage 3, and Stage 4) of CRC with the TCGA database indicated their strong prognostic performance from the earlier stages of the disease. It should be mentioned here that Patil et al. (2021) [22] identified CRC-causing 40 CGs for early diagnosis, which might be a large number for further investigation by the wet-lab researchers since it would be laborious, time-consuming, and costly, as mentioned earlier. In our proposed 10 CGs, 9 genes (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, MELK, and TPX2) were overlapped/common with their 40 CGs. In addition, we also investigated the association of our proposed CGs with different diseases. We found they have a strong pan-cancer role, including CRC. The literature review also supported the association of our proposed CGs with CRC. For example, AU-RKA is considered an oncogene that significantly impacts the proliferation and progression of colorectal carcinoma from colorectal adenoma [56]. Generally, AURKA is overexpressed and amplified in CRCs [57][58][59][60][61][62]. TOP2A is highly expressed during tumor development and responds to drug therapy for CRC [63]. CDK1 is also overexpressed and sensitive to apoptosis in CRC cells [64]. CDKN3 is highly expressed in CRC tissues and remarkably related to patients' diagnoses [65]. CKS2 overexpression is correlated with aggressive tumor development in CRC, meaning that CKS2 might function as a decent CRC biomarker [66]. CKS2 is a promising biomarker contributing to CRC tumor development [66]. CDC20, PTTG1, and MAD2L1 might be CRC stage-related genes [67]. MELK might play a role as an effective therapeutic target for CRC [68]. TPX2 is highly upregulated in CRC tissues [69].
We identified the top-ranked five GO terms and KEGG pathways of CGs to reveal their molecular mechanisms in CRC progression. The identified GO term 'cell cycle' is one of the most important biological processes in the human body [70]. It has four sequential phases. Arguably, the most important phases are the S phase (DNA replication occurs) and the M phase (cell divides into two daughter cells) [71]. The fundamental task of the cell cycle is to ensure that DNA is faithfully replicated once during S phase and that identical chromosomal copies are distributed equally to two daughter cells during M phase. Usually, in adult tissue, there is a delicate balance between cell death and proliferation (cell division), producing a steady state. Disruption of this equilibrium by loss of cell cycle control may eventually lead to tumor development [72], including colorectal cancer [73], colon cancer [74], liver cancer [75], glioblastoma [76], breast cancer [77,78], lung cancer [79,80], gastric cancer [81], etc. So, the cell cycle is considered a vital cancer progression process. TOP2A is related to tumor development and poor survival outcomes by regulating cell proliferation and the CRC cell cycle [82,83]. CDK1 controls the cell cycle and aids in the development of colorectal tumors via an iron-regulated signalling axis [64]. In most CRCs, chromosomal variability resulting in an abnormal chromosome number, aneuploidy, was systematically related to a mitotic checkpoint's loss of function [84,85]. The GO term 'mitotic spindle orientation' can influence tissue organization and control the placement of daughter cells within a tissue. Spindle misorientation greatly affects cancer development and progression [86], including CRC [87]. The top-ranked five MFs (cyclin-dependent protein kinase activity, CXCR chemokine receptor binding, protein kinase binding, RNA polymerase II CTD heptapeptide repeat kinase activity, and histone kinase activity) play a vital role in CRC development and proliferation [12,[88][89][90]. Similarly, the enriched five CCs, including the spindle, mitotic spindle, cyclin-dependent protein kinase holoenzyme complex, intracellular non-membrane-bounded organelle, and serine/threonine-protein kinase complex, are strongly related to the progression of CRC [91][92][93][94][95]. Chromosomal instability happens in 80%-85% of CRCs and is considered the most common subtype of CRC [94]. Subsequent research showed that chromosomal instability is caused by mutations in the genes that govern the mitotic spindle checkpoint [93]. The consecutive activation of a group of serine-threonine kinases controls eukaryotic cell-cycle checkpoints [95]. We identified the top 5 enriched common KEGG pathways (cell cycle, bladder cancer [96], oocyte meiosis [22], human T-cell leukemia virus 1 infection, progesterone-mediated oocyte maturation) that are also reported by some other studies [22,96,97].
We also identified key regulators of CGs, such as four TFs (NFIC, FOXC1, YY1, and GATA2) and six miRNAs that played a significant role in CRC development. FOXC1, a member of the forkhead box family, has been connected to the growth and progression of numerous diseases [98,99], particularly CRC [100,101]. YY1 is a multipurpose TF protein that can stimulate or suppress gene expression [102] and plays a significant role in CRC tumor growth [103]. In CRC, the high-level expressions of GATA2 were linked with a poor prognosis and recurrence in solid tumors [104]. Nuclear factor 1 C-type (NFIC) regulates PFKB3 in response to CRC [105].

Conclusions
The present study identified the 10 top-ranked DEGs (AURKA, TOP2A, CDK1, PTTG1, CDKN3, CDC20, MAD2L1, CKS2, MELK, and TPX2) as the core genes (CGs), which showed Strong prognostic performance in the earlier stages of CRC. The CGs-disease and pan-cancer analysis commonly showed that CGs have a robust pan-cancer role, including CRC, bladder urothelial carcinoma, esophageal carcinoma, glioblastoma multiforme, liver hepatocellular carcinoma, lung adenocarcinoma, prostate adenocarcinoma, and stomach adenocarcinoma. The CGs regulatory network analysis detected some essential TFs proteins (NFIC, FOXC1, YY1, and GATA2) and miRNAs (hsa-mir-147a, hsa-mir-129-2-3p, hsa-mir-124-3p, hsa-mir-34a-5p, hsa-mir-23b-3p, and hsa-mir-16-5p) as the transcriptional and post-transcriptional regulators of CGs. The enrichment analysis also revealed some important CRC-causing GO terms and signaling pathways. For example, cell cycle and mitotic spindle pathways have a significant association with CRC progression. Finally, we recommended our proposed CGsguided 7 candidate drug molecules (Manzamine A, Cardidigin, Staurosporine, Sitosterol, Benzo[a]pyrene, Nocardiopsis sp., and Riccardin D) for the treatment against the CRC by molecular docking analysis. Thus, the findings of this study may be more useful compared to the previous computation study results for early diagnosis, prognosis, and therapies forf CRC.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers15051369/s1, Figure S1: Box plots for the expressions of KGs with different stages (Stage 1, Stage 2, Stage 3, and Stage 4) of READ (Rectal adenocarcinoma), including the control stage in the TCGA database. Figure S2: Core gene-set enrichment analysis with different diseases. Figure S3 Table  S1: Transcriptome-guided 158 meta-drug agents associated with CRC infections collected from the DSigDB online database. Table S2: The list of the top 20 significant (p-value < 0.05) comorbidities associated with CGs.