Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis
Abstract
1. Introduction
2. Materials
3. Methods
3.1. Differential Gene Expression Analysis of the TCGA Database
3.2. Mendelian Randomization Analysis of eQTL Genomics and Colorectal Cancer Outcomes
3.3. Identification of Common Genes
3.4. Development and Validation of the Colorectal Cancer Genetic Diagnostic Model Using Nine Machine Learning Methods Based on GEO Database
3.5. SHAP Explanation of Machine Learning Models
4. Results
4.1. Gene Expression in Colorectal Cancer: Trends Toward Cell Proliferation, High Metabolism, and Immune Escape
4.2. Differential Expression of Intersection Genes from TCGA and MR Analysis Focused on Cell Proliferation
4.3. Development and Validation of the Colorectal Cancer Genetic Diagnostic Model Using Nine Machine Learning Methods
4.4. SHAP Explanation of the XGBoost Model
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Pinheiro, M.; Moreira, D.N.; Ghidini, M. Colon and rectal cancer: An emergent public health problem. World J. Gastroenterol. 2024, 30, 644–651. [Google Scholar] [CrossRef]
- Dekker, E.; Tanis, P.J.; Vleugels, J.L.; Kasi, P.M.; Wallace, M.B. Colorectal cancer. Lancet 2019, 394, 1467–1480. [Google Scholar] [CrossRef] [PubMed]
- Turell, R. Electrocoagulation of rectal cancer, colorectal adenomas, instrumentation, coloscopy, and biopsy. Recapitulatory comments. Surg. Clin. N. Am. 1972, 52, 817–828. [Google Scholar] [CrossRef]
- Hamilton, F.; Schurz, H.; Yates, T.A.; Gilchrist, J.J.; Möller, M.; Naranbhai, V.; Ghazal, P.; Timpson, N.J.; Parks, T.; Pollara, G. Altered IL-6 signalling and risk of tuberculosis: A multi-ancestry mendelian randomisation study. Lancet Microbe 2025, 6, 100922. [Google Scholar] [CrossRef]
- Tan, H.; Wang, S.; Huang, F.; Tong, Z. Association between breast cancer and thyroid cancer risk: A two-sample Mendelian randomization study. Front. Endocrinol. 2023, 14, 1138149. [Google Scholar] [CrossRef]
- Greener, J.G.; Kandathil, S.M.; Moffat, L.; Jones, D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022, 23, 40–55. [Google Scholar] [CrossRef]
- Handelman, G.S.; Kok, H.K.; Chandra, R.V.; Razavi, A.H.; Lee, M.J.; Asadi, H. eDoctor: Machine learning and the future of medicine. J. Intern. Med. 2018, 284, 603–619. [Google Scholar] [CrossRef] [PubMed]
- Blum, A.; Wang, P.; Zenklusen, J.C. SnapShot: TCGA-Analyzed Tumors. Cell 2018, 173, 530. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, H. eQTL studies: From bulk tissues to single cells. J. Genet. Genom. 2023, 50, 925–933. [Google Scholar] [CrossRef] [PubMed]
- Orozco, L.D.; Chen, H.H.; Cox, C.; Katschke, K.J.; Arceo, R., Jr.; Espiritu, C.; Caplazi, P.; Nghiem, S.S.; Chen, Y.J.; Modrusan, Z.; et al. Integration of eQTL and a Single-Cell Atlas in the Human Eye Identifies Causal Genes for Age-Related Macular Degeneration. Cell Rep. 2020, 30, 1246–1259.e1246. [Google Scholar] [CrossRef]
- Bai, Y.; Wang, X.; Xu, Y.; Jiang, C.; Liu, H.; Xu, Z.; Shen, J.; Zhang, X.; Zhang, Q.; Du, Y. Vitamin D and Gestational Diabetes Mellitus in the IEU OpenGWAS Project: A Two-Sample Bidirectional Mendelian Randomization Study. Nutrients 2024, 16, 2836. [Google Scholar] [CrossRef] [PubMed]
- Sun, X.; Chen, B.; Qi, Y.; Wei, M.; Chen, W.; Wu, X.; Wang, Q.; Li, J.; Lei, X.; Luo, G. Multi-omics Mendelian randomization integrating GWAS, eQTL and pQTL data revealed GSTM4 as a potential drug target for migraine. J. Headache Pain 2024, 25, 117. [Google Scholar] [CrossRef]
- Li, Y.; Gu, J.; Xu, F.; Zhu, Q.; Ge, D.; Lu, C. Transcriptomic and functional network features of lung squamous cell carcinoma through integrative analysis of GEO and TCGA data. Sci. Rep. 2018, 8, 15834. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, G.; Liu, W. Identification of SNCA and DRD2 as key genes linking parkinson’s disease and circadian rhythm through bioinformatics analysis. Sci. Rep. 2025, 15, 31355. [Google Scholar] [CrossRef]
- Liang, D.; Wang, L.; Zhong, P.; Lin, J.; Chen, L.; Chen, Q.; Liu, S.; Luo, Z.; Ke, C.; Lai, Y. Perspective: Global Burden of Iodine Deficiency: Insights and Projections to 2050 Using XGBoost and SHAP. Adv. Nutr. 2025, 16, 100384. [Google Scholar] [CrossRef]
- Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef]
- Zhang, J.; Ou, D.; Xie, A.; Chen, D.; Li, X. Global burden and cross-country health inequalities of early-onset colorectal cancer and its risk factors from 1990 to 2021 and its projection until 2036. BMC Public Health 2024, 24, 3124. [Google Scholar] [CrossRef] [PubMed]
- Shah, S.C.; Itzkowitz, S.H. Colorectal cancer in inflammatory bowel disease: Mechanisms and management. Gastroenterology 2022, 162, 715–730.e3. [Google Scholar] [CrossRef] [PubMed]
- Siegel, R.L.; Wagle, N.S.; Cercek, A.; Smith, R.A.; Jemal, A. Colorectal cancer statistics, 2023. CA Cancer J. Clin. 2023, 73, 233–254. [Google Scholar] [CrossRef]
- Moon, J.H.; Hong, J.; Chae, S.W.; Choi, I.; Kim, C.; Bae, J.M.; Kang, G.H.; Kim, S.; Jung, M.; Kim, J.H. Quantitative Histology of Non-Metastatic Regional Lymph Nodes as a Novel Prognostic Indicator in Microsatellite Instability-High Colorectal Cancer. Mod. Pathol. 2025, 39, 100948. [Google Scholar] [CrossRef]
- Qi, X.; Bai, C.; Dong, L.; Wang, A.; Wei, C.; Li, Y.; Zhao, M.; You, C. Synergistic DBNDD1–GDF15 Signaling Activates the NF-κB Pathway to Promote Colorectal Cancer Progression. Mol. Cancer Res. 2025, OF1–OF15. [Google Scholar] [CrossRef]
- Andreu, P.; Colnot, S.; Godard, C.; Laurent-Puig, P.; Lamarque, D.; Kahn, A.; Perret, C.; Romagnolo, B.J.C. Identification of the IFITM family as a new molecular marker in human colorectal tumors. Cancer Res. 2006, 66, 1949–1955. [Google Scholar] [CrossRef]
- Ou, S.; Chen, H.; Wang, H.; Ye, J.; Liu, H.; Tao, Y.; Ran, S.; Mu, X.; Liu, F.; Zhu, S.; et al. Fusobacterium nucleatum upregulates MMP7 to promote metastasis-related characteristics of colorectal cancer cell via activating MAPK(JNK)-AP1 axis. J. Transl. Med. 2023, 21, 704. [Google Scholar] [CrossRef]
- Roy, R.; Morad, G.; Jedinak, A.; Moses, M.A. Metalloproteinases and their roles in human cancer. Anat. Rec. 2020, 303, 1557–1572. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Xia, L.; Huang, P.; Wang, Z.; Guo, Q.; Huang, C.; Leng, W.; Qin, S. Serine protease PRSS56, a novel cancer-testis antigen activated by DNA hypomethylation, promotes colorectal and gastric cancer progression via PI3K/AKT axis. Cell Biosci. 2023, 13, 124. [Google Scholar] [CrossRef] [PubMed]
- Wang, R.; Dai, W.; Gong, J.; Huang, M.; Hu, T.; Li, H.; Lin, K.; Tan, C.; Hu, H.; Tong, T.; et al. Development of a novel combined nomogram model integrating deep learning-pathomics, radiomics and immunoscore to predict postoperative outcome of colorectal cancer lung metastasis patients. J. Hematol. Oncol. 2022, 15, 11. [Google Scholar] [CrossRef]
- Cao, R.; Yang, F.; Ma, S.C.; Liu, L.; Zhao, Y.; Li, Y.; Wu, D.H.; Wang, T.; Lu, W.J.; Cai, W.J.; et al. Development and interpretation of a pathomics-based model for the prediction of microsatellite instability in Colorectal Cancer. Theranostics 2020, 10, 11080–11091. [Google Scholar] [CrossRef] [PubMed]



Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yin, Y.; Yang, Z.; Li, X.; Gong, S.; Xu, C. Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis. Genes 2026, 17, 114. https://doi.org/10.3390/genes17010114
Yin Y, Yang Z, Li X, Gong S, Xu C. Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis. Genes. 2026; 17(1):114. https://doi.org/10.3390/genes17010114
Chicago/Turabian StyleYin, Yulai, Zhen Yang, Xueqing Li, Shuo Gong, and Chen Xu. 2026. "Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis" Genes 17, no. 1: 114. https://doi.org/10.3390/genes17010114
APA StyleYin, Y., Yang, Z., Li, X., Gong, S., & Xu, C. (2026). Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis. Genes, 17(1), 114. https://doi.org/10.3390/genes17010114

