An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer
Abstract
1. Introduction
2. Methodology
2.1. Study Identification and LLM Screening
2.2. Inclusion and Exclusion Criteria
2.3. Preprocessing of Metadata
2.3.1. Clinical Data Preparation
2.3.2. Preprocessing of Gene-Expression Data
2.3.3. Batch Effect Assessment, Correction, and Quality Control
2.4. Graphical Summary
3. Results
4. Discussion
Strengths, Limitations, and Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
- Molina, J.R.; Yang, P.; Cassivi, S.D.; Schild, S.E.; Adjei, A.A. Non-Small Cell Lung Cancer: Epidemiology, Risk Factors, Treatment, and Survivorship. Mayo Clin. Proc. 2008, 83, 584–594. [Google Scholar] [CrossRef]
- Goldstraw, P.; Chansky, K.; Crowley, J.; Rami-Porta, R.; Asamura, H.; Eberhardt, W.E.; Nicholson, A.G.; Groome, P.; Mitchell, A.; Bolejack, V.; et al. The IASLC Lung Cancer Staging Project: Proposals for Revision of the TNM Stage Groupings in the Forthcoming Eighth Edition. J. Thorac. Oncol. 2016, 11, 39–51. [Google Scholar] [CrossRef]
- Detterbeck, F.C.; Boffa, D.J.; Kim, A.W.; Tanoue, L.T. The Eighth Edition Lung Cancer Stage Classification. Chest 2017, 151, 193–203. [Google Scholar] [CrossRef]
- Arriagada, R.; Bergman, B.; Dunant, A.; Le Chevalier, T.; Pignon, J.P.; Vansteenkiste, J.; International Adjuvant Lung Cancer Trial Collaborative Group. Cisplatin-Based Adjuvant Chemotherapy in Patients with Completely Resected Non-Small-Cell Lung Cancer. N. Engl. J. Med. 2004, 350, 351–360. [Google Scholar] [CrossRef]
- Winton, T.; Livingston, R.; Johnson, D.; Rigas, J.; Johnston, M.; Butts, C.; Cormier, Y.; Goss, G.; Inculet, R.; Vallieres, E.; et al. Vinorelbine plus Cisplatin vs. Observation in Resected Non-Small-Cell Lung Cancer. N. Engl. J. Med. 2005, 352, 2589–2597. [Google Scholar] [CrossRef]
- Douillard, J.Y.; Rosell, R.; De Lena, M.; Carpagnano, F.; Ramlau, R.; Gonzáles-Larriba, J.L.; Grodzki, T.; Pereira, J.R.; Le Groumellec, A.; Lorusso, V.; et al. Adjuvant Vinorelbine plus Cisplatin versus Observation in Completely Resected Stage IB–IIIA Non-Small-Cell Lung Cancer (ANITA). Lancet Oncol. 2006, 7, 719–727. [Google Scholar] [CrossRef] [PubMed]
- National Comprehensive Cancer Network (NCCN). Non-Small Cell Lung Cancer. Version 2.2025. In NCCN Clinical Practice Guidelines in Oncology; National Comprehensive Cancer Network (NCCN): Plymouth Meeting, PA, USA, 2025. [Google Scholar]
- Zhu, C.Q.; Ding, K.; Strumpf, D.; Weir, B.A.; Meyerson, M.; Pennell, N.; Thomas, R.K.; Naoki, K.; Ladd-Acosta, C.; Liu, N.; et al. Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung Cancer. J. Clin. Oncol. 2010, 28, 4417–4424. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.-Y.; Yu, S.-L.; Chen, C.-H.; Chang, G.-C.; Chen, C.-Y.; Yuan, A.; Cheng, C.-L.; Wang, C.-H.; Terng, H.-J.; Kao, S.-F.; et al. A five-gene signature and clinical outcome in non–small-cell lung cancer. N. Engl. J. Med. 2007, 356, 11–20. [Google Scholar] [CrossRef]
- Chen, D.-T.; Hsu, Y.-L.; Fulp, W.J.; Coppola, D.; Haura, E.B.; Yeatman, T.J.; Cress, W.D. Prognostic and Predictive Value of a Malignancy-Risk Gene Signature in Early-Stage Non–Small Cell Lung Cancer. J. Natl. Cancer Inst. 2011, 103, 1859–1870. [Google Scholar] [CrossRef] [PubMed]
- Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma; Shedden, K.; Taylor, J.M.; Enkemann, S.A.; Tsao, M.S.; Yeatman, T.J.; Gerald, W.L.; Eschrich, S.; Jurisica, I.; Giordano, T.J.; et al. Gene Expression-Based Survival Prediction in Lung Adenocarcinoma: A Multi-Site, Blinded Validation Study. Nat. Med. 2008, 14, 822–827. [Google Scholar] [CrossRef]
- Bepler, G.; Olaussen, K.A.; Vataire, A.L.; Soria, J.-C.; Zheng, Z.; Dunant, A.; Pignon, J.-P.; Schell, M.J.; Fouret, P.; Pirker, R.; et al. ERCC1 and RRM1 in the International Adjuvant Lung Trial by Automated Quantitative In Situ Analysis. Am. J. Pathol. 2011, 178, 69–78. [Google Scholar] [CrossRef]
- Kadara, H.; Behrens, C.; Yuan, P.; Solis, L.; Liu, D.; Gu, X.; Minna, J.D.; Lee, J.J.; Kim, E.; Hong, W.-K.; et al. A Five-Gene and Corresponding Protein Signature for Stage I Lung Adenocarcinoma Prognosis. Clin. Cancer Res. 2011, 17, 1490–1501. [Google Scholar] [CrossRef]
- Subramanian, J.; Simon, R. Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use? J. Natl. Cancer Inst. 2010, 102, 464–474. [Google Scholar] [CrossRef] [PubMed]
- Botling, J.; Edlund, K.; Lohr, M.; Hellwig, B.; Holmberg, L.; Lambe, M.; Berglund, A.; Ekman, S.; Bergqvist, M.; Pontén, F.; et al. Biomarker Discovery in Non-Small Cell Lung Cancer: Integrating Gene Expression Profiling, Meta-Analysis, and Tissue Microarray Validation. Clin. Cancer Res. 2013, 19, 194–204. [Google Scholar] [CrossRef]
- Tang, H.; Wang, S.; Xiao, G.; Schiller, J.; Papadimitrakopoulou, V.; Minna, J.; Wistuba, I.I.; Xie, Y. Comprehensive Evaluation of Published Gene-Expression Prognostic Signatures for Lung Cancer. Ann. Oncol. 2017, 28, 733–740. [Google Scholar] [CrossRef]
- Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, Normalization, and Summaries of High-Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4, 249–264. [Google Scholar] [CrossRef]
- Dai, M.; Wang, P.; Boyd, A.D.; Kostov, G.; Athey, B.; Jones, E.G.; Bunney, W.E.; Myers, R.M.; Speed, T.P.; Akil, H.; et al. Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data. Nucleic Acids Res. 2005, 33, e175. [Google Scholar] [CrossRef] [PubMed]
- Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics 2007, 8, 118–127. [Google Scholar] [CrossRef]
- Baty, F.; Facompré, M.; Kaiser, S.; Schumacher, M.; Pless, M.; Bubendorf, L.; Savic, S.; Marrer, E.; Budach, W.; Buess, M.; et al. Gene Profiling of Clinical Routine Biopsies and Prediction of Survival in Non-Small Cell Lung Cancer. Am. J. Respir. Crit. Care Med. 2010, 181, 181–188. [Google Scholar] [CrossRef] [PubMed]
- Harrell, F.E., Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd ed.; Springer: New York, NY, USA, 2015. [Google Scholar]
- Shen, Y.; Huang, J.; He, H. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. Brief. Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef]
- Sun, O.; Cheuk, M.; Moon, H. Large Language Models Empower Meta-Analysis in the Big Data Era. Extended Abstract. In Proceedings of the Joint Statistical Meetings (JSM 2025), Nashvile, TN, USA, 2–7 August 2025. [Google Scholar]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
- Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
- Selenium, H.Q. Selenium [Internet]. GitHub. 26 November 2024. Available online: https://github.com/SeleniumHQ/selenium (accessed on 13 December 2024).
- Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the Potential of Prompt Engineering for Large Language Models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef] [PubMed]
- Xie, Y.; Xiao, G.; Coombes, K.R.; Behrens, C.; Solis, L.M.; Raso, G.; Girard, L.; Erickson, H.S.; Roth, J.; Heymach, J.V.; et al. Robust Gene Expression Signature from Formalin-Fixed Paraffin-Embedded Samples Predicts Prognosis of Non-Small-Cell Lung Cancer Patients. Clin. Cancer Res. 2011, 17, 5705–5714. [Google Scholar] [CrossRef]
- Jabs, V.; Edlund, K.; König, H.; Grinberg, M.; Madjar, K.; Rahnenführer, J.; Ekman, S.; Bergkvist, M.; Holmberg, L.; Ickstadt, K.; et al. Integrative Analysis of Genome-Wide Gene Copy Number Changes and Gene Expression in Non-Small Cell Lung Cancer. PLoS ONE 2017, 12, e0187246. [Google Scholar] [CrossRef]
- Lohr, M.; Hellwig, B.; Edlund, K.; Mattsson, J.S.M.; Botling, J.; Schmidt, M.; Hengstler, J.G.; Micke, P.; Rahnenführer, J. Identification of Sample Annotation Errors in Gene Expression Datasets. Arch. Toxicol. 2015, 89, 2265–2272. [Google Scholar] [CrossRef]
- Goldmann, T.; Marwitz, S.; Nitschkowski, D.; Krupar, R.; Backman, M.; Elfving, H.; Thurfjell, V.; Lindberg, A.; Brunnström, H.; La Fleur, L.; et al. PD-L1 Amplification Is Associated with an Immune Cell Rich Phenotype in Squamous Cell Cancer of the Lung. Cancer Immunol. Immunother. 2021, 70, 2577–2587. [Google Scholar] [CrossRef]
- Khadse, A.; Haakensen, V.D.; Silwal-Pandit, L.; Hamfjord, J.; Micke, P.; Botling, J.; Brustugun, O.T.; Lingjærde, O.C.; Helland, Å.; Kure, E.H. Prognostic Significance of the Loss of Heterozygosity of KRAS in Early-Stage Lung Adenocarcinoma. Front. Oncol. 2022, 12, 873532. [Google Scholar] [CrossRef]
- Okayama, H.; Kohno, T.; Ishii, Y.; Shimada, Y.; Shiraishi, K.; Iwakawa, R.; Furuta, K.; Tsuta, K.; Shibata, T.; Yamamoto, S.; et al. Identification of Genes Upregulated in ALK-Positive and EGFR/KRAS/ALK-Negative Lung Adenocarcinomas. Cancer Res. 2012, 72, 100–111. [Google Scholar] [CrossRef] [PubMed]
- Yamauchi, M.; Yamaguchi, R.; Nakata, A.; Kohno, T.; Nagasaki, M.; Shimamura, T.; Imoto, S.; Saito, A.; Ueno, K.; Hatanaka, Y.; et al. Epidermal Growth Factor Receptor Tyrosine Kinase Defines Critical Prognostic Genes of Stage I Lung Adenocarcinoma. PLoS ONE 2012, 7, e43923. [Google Scholar] [CrossRef]
- Der, S.D.; Sykes, J.; Pintilie, M.; Zhu, C.-Q.; Strumpf, D.; Liu, N.; Jurisica, I.; Shepherd, F.A.; Tsao, M.-S. Validation of a Histology-Independent Prognostic Gene Signature for Early-Stage, Non-Small-Cell Lung Cancer Including Stage IA Patients. J. Thorac. Oncol. 2014, 9, 59–64. [Google Scholar] [CrossRef] [PubMed]
- Bueno, R.; Richards, W.G.; Harpole, D.H.; Ballman, K.V.; Tsao, M.-S.; Chen, Z.; Wang, X.; Chen, G.; Chirieac, L.R.; Chui, M.H.; et al. Multi-Institutional Prospective Validation of Prognostic mRNA Signatures in Early-Stage Squamous Lung Cancer (Alliance). J. Thorac. Oncol. 2020, 15, 1748–1757. [Google Scholar] [CrossRef]
- Bolstad, B.M. Pre-Processing DNA Microarray Data. In Fundamentals of Data Mining in Genomics and Proteomics; Dubitzky, W., Granzow, M., Berrar, D.P., Eds.; Springer: Boston, MA, USA, 2007; pp. 51–78. [Google Scholar]
- Ballman, K.V.; Grill, D.E.; Oberg, A.L.; Therneau, T.M. Faster Cyclic Loess: Normalizing RNA Arrays via Linear Models. Bioinformatics 2004, 20, 2778–2786. [Google Scholar] [CrossRef]
- Jolliffe, I.T.; Cadima, J. Principal Component Analysis, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
- McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Kauffmann, A.; Gentleman, R.; Huber, W. ArrayQualityMetrics—A Bioconductor Package for Quality Assessment of Microarray Data. Bioinformatics 2009, 25, 415–416. [Google Scholar] [CrossRef] [PubMed]
Platform | GEO Series | Patients (Total/ACT) | Probe Sets (Raw) |
---|---|---|---|
GPL570 | GSE29013; GSE37745; GSE31908; GSE31210 a; GSE50081 a; GSE157010 a | 788/65 | 54,675 |
GPL96 | GSE68465, GSE14814 | 559/159 | 22,283 |
Feature | Level | Cohort (n = 1361) |
---|---|---|
Age, years | Median | 65 |
Range | 30–89 | |
Mean | 65 | |
(%) | Female | 602 (44.9) |
Male | 738 (55.1) | |
(%) | Caucasian | 354 (26.4) |
African American | 14 (1.0) | |
Asian | 7 (0.5) | |
Native Hawaiian | 1 (0.1) | |
1 Unknown | 964 (71.9) | |
(%) | IA | 407 (30.4) |
IB | 462 (34.5) | |
II | 343 (25.6) | |
III | 128 (9.5) | |
(%) | Adenocarcinoma | 923 (68.9) |
Squamous Cell Carcinoma | 384 (28.7) | |
Large Cell Carcinoma | 31 (2.3) | |
Adenosquamous Carcinoma | 2 (0.1) | |
(%) | Yes | 588 (43.9) |
No | 182 (13.6) | |
Unknown | 570 (42.5) | |
ACT, (%) | Yes | 223 (16.6) |
No | 1117 (83.4) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moon, H.; Cheuk, M.Y.; Sun, O.; Lee, K.; Kim, G.; Kwak, K.; Kwak, K.; Tam, A.C. An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Appl. Sci. 2025, 15, 10733. https://doi.org/10.3390/app151910733
Moon H, Cheuk MY, Sun O, Lee K, Kim G, Kwak K, Kwak K, Tam AC. An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Applied Sciences. 2025; 15(19):10733. https://doi.org/10.3390/app151910733
Chicago/Turabian StyleMoon, Hojin, Michelle Y. Cheuk, Owen Sun, Katherine Lee, Gyumin Kim, Kaden Kwak, Koeun Kwak, and Aaron C. Tam. 2025. "An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer" Applied Sciences 15, no. 19: 10733. https://doi.org/10.3390/app151910733
APA StyleMoon, H., Cheuk, M. Y., Sun, O., Lee, K., Kim, G., Kwak, K., Kwak, K., & Tam, A. C. (2025). An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Applied Sciences, 15(19), 10733. https://doi.org/10.3390/app151910733