An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer

Moon, Hojin; Cheuk, Michelle Y.; Sun, Owen; Lee, Katherine; Kim, Gyumin; Kwak, Kaden; Kwak, Koeun; Tam, Aaron C.

doi:10.3390/app151910733

Open AccessArticle

An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer

by

Hojin Moon

^1,*

,

Michelle Y. Cheuk

¹,

Owen Sun

²

,

Katherine Lee

¹,

Gyumin Kim

¹,

Kaden Kwak

¹,

Koeun Kwak

¹ and

Aaron C. Tam

¹

Department of Mathematics and Statistics, California State University, 1250 Bellflower Blvd., Long Beach, CA 90840, USA

²

California Academy of Mathematics and Science, 1000 E. Victoria St., Carson, CA 90747, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10733; https://doi.org/10.3390/app151910733

Submission received: 2 September 2025 / Revised: 28 September 2025 / Accepted: 2 October 2025 / Published: 5 October 2025

(This article belongs to the Special Issue Generative Artificial Intelligence for Clinical Decision Support System and Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Background: Personalizing adjuvant chemotherapy (ACT) after curative resection in early-stage NSCLC remains unmet because prior ACT-biomarker findings rarely reproduce across studies. Key barriers are platform and preprocessing heterogeneity, dominant batch effects, and incomplete ACT annotations. As a result, many signatures that perform well in a single cohort fail during external validation. We created an open, harmonized meta-database linking gene expression with curated ACT exposure and survival to enable fair benchmarking and modeling. Methods: A PRISMA-guided search of 999 GEO studies (through January 2025) used LLM-assisted triage of titles, clinical tables, and free text to identify datasets with explicit ACT status and patient-level survival. Eight Affymetrix microarray cohorts (GPL570/GPL96) met eligibility. Raw CEL files underwent robust multi-array average; probes were re-annotated to Entrez IDs and collapsed by median. Covariate-preserving ComBat adjusted platform/study while retaining several clinical factors. Batch structure was quantified by principal-component analysis (PCA) variance, silhouette width, and UMAP. Two quality-control (QC) filters, median M-score deviation and PCA leverage, flagged and removed technical outliers. Results: The final meta-database comprises 1340 patients (223 (16.6%) ACT; 1117 (83.4%) observation), 13,039 intersecting genes, and 594 overall-survival events. Batch-associated variance (PC1 + PC2) decreased from 63.1% to 20.1%, and mean silhouette width shifted from 0.82 to −0.19 post-correction. Seven arrays (0.5%) were excluded by QC. Event depth supports high-dimensional survival and heterogeneity-of-treatment modeling, and the multi-cohort design enables internal–external validation. Conclusions: This first open, rigorously harmonized NSCLC transcriptomic database provides the sample size, demographic diversity, and technical consistency required to benchmark ACT-benefit markers. By making these data openly available, it will accelerate equitable precision-oncology research and enable data-driven treatment decisions in early-stage NSCLC.

Keywords:

bioinformatics; chemotherapy; GEO datasets; NSCLC; personalized medicine

1. Introduction

Lung cancer remains the leading cause of oncologic death, with non-small cell lung cancer (NSCLC) accounting for roughly 85% of cases [1,2]. Despite curative resection, five-year overall survival seldom exceeds 55% in stage II–III NSCLC [3,4]. Randomized clinical trials such as IALT, JBR.10, and ANITA demonstrated a modest 4–5% absolute survival benefit at five years from cisplatin-based adjuvant platinum-based chemotherapy (ACT) [5,6,7]. However, many patients experience substantial toxicity without clinical gain. Consequently, practice guidelines now recognize the urgent need for reliable biomarkers to identify individuals most likely to benefit from ACT [8].

Despite randomized trials showing a modest average survival benefit from cisplatin-based ACT, many patients incur toxicity without gain, underscoring the need for biomarkers that identify likely beneficiaries. Yet a decade of ACT-biomarker efforts has not yielded clinically reliable tools. The persistent barrier is data inconsistency: public NSCLC transcriptomes span multiple array chemistries and study pipelines, batch effects overwhelm biological variation, and ACT metadata are often absent or ambiguous. Furthermore, platform heterogeneity and non-uniform preprocessing introduce spurious differences that masquerade as biology. Consequently, signatures trained in one dataset commonly fail to generalize.

Large, randomized trials provided invaluable biospecimens that sparked a wave of genomic investigations into ACT responsiveness [5,6,7,9]. Dozens of expression signatures (e.g., 15-gene prognostic score [10], lung metagene models [11,12,13,14]) show promise, but reproducibility and generalizability are limited by small cohorts, diverse microarray platforms, and inconsistent metadata [15]. Meta-analyses integrating publicly available microarray studies have attempted to mitigate these limitations; however, most studies have focused on prognosis alone and have frequently excluded key treatment details, including ACT status [16,17].

Prior meta-analyses have improved prognostic modeling, but most did not include or harmonize ACT exposure; thus, the field lacks an open, rigorously processed benchmark that links gene expression, ACT status, and survival at the patient level across cohorts. Without such a foundation, investigators cannot fairly compare methods, quantify transportability, or perform internal–external validation needed for treatment-effect modeling.

Public repositories house more than 50,000 NSCLC transcriptomes, but batch effects and missing ACT annotations impede reuse. Integrating these heterogeneous resources can dramatically expand statistical power and improve representation of diverse patient populations. Yet genuine data harmonization is non-trivial: studies differ in array platform chemistry (Affymetrix HG-U133 Plus 2.0, HG-U133A, Illumina BeadChip, among others), preprocessing pipelines, gene-probe mappings and variable nomenclature. Recent advances—robust multi-array average (RMA) normalization [18], probe re-annotation frameworks [19], and empirical Bayes batch correction (ComBat) [20]—enable large-scale integration, while large-language-models (LLMs) now expedite study selection.

Integrating transcriptomes from geographically diverse clinical trials expands the demographic breadth of discovery datasets, thereby reducing the bias inherent in single-institution series and increasing the likelihood that derived biomarkers will perform reliably across populations [21]. Moreover, the enlarged sample size facilitates robust training and independent validation of modern machine-learning algorithms that require hundreds of events to avoid over-fitting [22].

Despite these advances, no open-source meta-database has yet combined both gene expression and curated ACT information together with harmonized survival outcomes. In this paper, we therefore constructed and quality-controlled the first open genomic meta-database that unites gene expression profiles with harmonized ACT exposure and survival outcomes, providing a robust platform for discovering ACT-related biomarkers in early-stage NSCLC.

Our goal in this work is therefore to solve the data layer: to construct a high-fidelity, open-access meta-database that (i) curates explicit patient-level ACT annotations, (ii) harmonizes clinical covariates and survival across eight Affymetrix cohorts, (iii) performs a transparent, reproducible raw-to-matrix pipeline (RMA, re-annotation, intersection genes, covariate-preserving ComBat, multi-metric QC), and (iv) provides a sufficiently large, technically consistent resource—thereby enabling reproducible benchmarking and heterogeneity-of-treatment-effect (HTE) biomarker discovery. This work contributes the first open, harmonized transcriptomic meta-database for early-stage NSCLC that links gene expression, ACT exposure, and survival.

2. Methodology

In this section, we detail the LLM-assisted selection logic, preprocessing pipeline, harmonization procedures, and quality-control methods underpinning the meta-database.

2.1. Study Identification and LLM Screening

LLMs are starting to be applied in wide-ranging biomedical applications, including in information extraction and retrieval [23]. In our recent study, we demonstrated that LLM-aided study screening can shorten the days-to-weeks manual curation of GEO records to mere hours by automatically extracting key metadata from free text and clinical data tables [24]. We identified 999 GEO records; LLM triage retained 49 for dual human review; full-text/metadata screening yielded 8 eligible cohorts meeting all criteria.

A PRISMA (Preferred Reporting Items for Systematic Review and Meta-Analysis) [25] compliant workflow (Figure 1) identified 999 records in GEO search. LLM triage excluded 950, retaining 49 for dual human review. Human review excluded 18 for missing patient-level overall survival or for lacking both ≥50% stage I and explicit ACT annotation, leaving 31 datasets for report/metadata retrieval. One dataset without ACT/report was excluded, leaving 30 reports. At full-text/metadata screening we excluded 18 non-Affymetrix platforms, 1 with no raw CEL files, 1 with unknown ACT status, and 2 duplicates, yielding 8 eligible NSCLC cohorts in GEO.

To identify as many relevant NSCLC studies in GEO as possible for study selection, we utilized a broad search query, searching for all publicly available microarray studies through January 2025 containing any of the following keywords: “lung adenocarcinomas,” “lung carcinomas,” “lung squamous cell carcinomas,” “large cell carcinomas,” “bronchioalveolar carcinomas,” “pulmonary adenocarcinomas,” “lung adenocarcinoma,” “lung carcinoma,” “lung squamous cell carcinoma,” “large cell carcinoma,” “bronchioalveolar carcinoma,” “pulmonary adenocarcinoma,” “NSCLC,” or “non-small cell lung cancer.” The search identified 999 records in GEO.

For each record, we programmatically saved the study title, dataset summary, and platform information via NCBI’s Entrez Programming Utilities API using the Bio.Entrez module in the Python 3.10.9 (Python Software Foundation, Wilmington, DE, USA) library Biopython [26]. Sample-level clinical tables were accessed from GEO with the GEO2r tool, saved using the Python library Selenium [27], and merged into a single CSV file per study to ensure completeness.

We then prompted an LLM via the OpenAI API to classify studies using inclusion and exclusion criteria. Initially, we used the GPT-4o mini LLM (version 2024-07-18) before later switching to GPT-4.1 mini (version 2025-04-14), a newer model. This switch was made after GPT-4.1 mini demonstrated superior sensitivity to GPT-4o mini in a set of 536 GEO studies, a subset of the 999 GEO studies we consider in this study restricted to studies with keywords “non-small cell lung cancer” or “NSCLC,” with two independent human reviewers as reference (GPT-4.1 mini: 92%; GPT-4o mini: 75%) [24]. In the subset, GPT-4.1 mini correctly included 11 studies and excluded 505 studies but incorrectly included 19 studies and excluded 1 study, demonstrating 96.4% specificity and 96.3% accuracy [24]. GPT-4.1 mini strictly applied inclusion and exclusion criteria, leading to the false exclusion of one study [24]. The study described cases as “early stage” without specifying explicit stages, which human reviewers chose to include for further review [24]. Additionally, GPT-4.1 mini sometimes confused discussion of survival in study descriptions with presence of patient-specific survival or elected to include studies despite acknowledging lack of patient-specific survival, resulting in incorrect inclusions [24].

Both GPT-4.1 mini and GPT-4o mini were prompted using the same prompt and study data, which were collected and saved to CSV files in January 2025, in a zero-shot setting with temperature set to zero to ensure reproducibility. The prompt required patient-specific overall-survival data and either 50% or more patients with stage I NSCLC (to capture untreated early-stage cohorts) or explicit ACT annotation [24]. Chain-of-thought and expert role-play prompting were used to improve sensitivity [28]. The prompt is available in the GitHub repository found in the Data Availability Statement. Hereafter, “LLM” refers exclusively to GPT-4.1 mini.

Based on GEO titles, descriptions, and data accessed via GEO2R, the LLM excluded 950 records, resulting in 49 records for human review. The reviews of the title, description, and data of each of the resulting 49 records were conducted independently by two human reviewers. Discrepancies were resolved by discussion. Human review resulted in the exclusion of 18 records on the basis of missing patient-specific overall survival or both of the following: less than 50% of patients with stage I NSCLC and missing explicit ACT annotation. As a result, reports (GEO metadata and associated full-text publications) for each of the 31 resulting records were sought to retrieve ACT annotation, if explicit, patient-specific annotation was not included in the dataset. One dataset (GSE31547) did not have ACT annotation nor a report, resulting in exclusion. A total of 30 reports were evaluated to exclude datasets with microarray platforms other than GPL570 and GPL96 (18 excluded); missing raw CEL files (1 excluded); unknown ACT administration status (1 excluded); and duplicated GEO datasets not previously excluded (2 excluded). This resulted in 8 cohorts for inclusion in the meta-database which were associated with 12 publications [9,12,16,29,30,31,32,33,34,35,36,37].

To ensure raw-level harmonization under a single array chemistry, the present release restricts inclusion to Affymetrix GPL570 (U133 Plus 2.0) and GPL96 (U133A) cohorts. At the full-text/metadata stage, 30 reports were evaluated and 18 were excluded precisely for using non-Affymetrix platforms, leaving Affymetrix studies as the only candidates that simultaneously offered explicit ACT status, patient-level survival, and raw CEL files for uniform preprocessing. Within the included cohorts, GPL570 and GPL96 contributed 54,675 and 22,283 probe sets, respectively; after re-annotation to Entrez IDs and gene-level collapsing, restricting to the cross-platform intersection produced a single matrix of 13,039 genes spanning 1340 patients after QC. We note that incorporating additional microarray chemistries or RNA-seq would require cross-technology normalization and a smaller common gene set or model-level fusion; to maintain a transparent benchmark in this initial release, we prioritized a single-chemistry core with demonstrated batch attenuation and will provide a separately versioned, expanded layer as those pipelines mature.

The eight studies met stringent quality criteria (detailed in Section 2.3) and were retained for integration, spanning two Affymetrix whole-genome microarray platforms: GPL570 and GPL96. We focused on Affymetrix because it provided both the largest pool of eligible studies and greater platform consistency, thereby reducing technical variability and enhancing cross-study integration. The selected 8 studies were summarized in Table 1.

2.2. Inclusion and Exclusion Criteria

Selected studies featured GEO datasets with primary resected stage I-IIIA NSCLC, known receipt or non-receipt of adjuvant chemotherapy, patient-specific overall survival and censoring information, and available microarray profiling on Affymetrix whole-genome platforms (platforms GPL570 and GPL96). Exclusion criteria included studies involving patients receiving neoadjuvant, non-chemotherapy adjuvants, next-line systemic chemotherapy (i.e., chemotherapy not administered in the adjuvant setting due to chemorefractory or recurrent disease), or with small cell lung cancer histology were excluded. Studies with RNA obtained from cell lines and xenograft-derived datasets were excluded.

2.3. Preprocessing of Metadata

2.3.1. Clinical Data Preparation

To ensure consistent cross-study analysis, we first identified a core set of clinical factors that appeared in at least 40% of all datasets. These were sex, age, tumor histology, smoking history, pathological stage, and race/ethnicity. In addition, ACT status and overall-survival data were mandatory for inclusion in the analytic cohort.

The original clinical annotations showed substantial variation in terminology, coding, and structure, requiring a harmonization step before integration. Categorical variables, including stage, histology, sex, race/ethnicity, and smoking history, were recorded to a unified set of categories (e.g., American Joint Committee on Cancer (AJCC) for stage; histology as adenocarcinoma, squamous cell carcinoma, large cell carcinoma, adenosquamous carcinoma; smoking history as yes, no, or unknown). Continuous variables such as age and survival time were reformatted to use consistent units and examined for irregular values (e.g., non-positive survival times or unit mismatches). Missing values in categorical fields were coded as Unknown, while missing age values were imputed using a random forest approach to preserve sample size.

Records missing any of the critical variables—ACT status, survival time, or survival outcome—were excluded. One stage IV patient originating from GSE37745 was removed as the treatment intent was outside the adjuvant setting. After these steps, we obtained a clean, harmonized dataset of 1347 patients with standardized labels and definitions across all studies.

2.3.2. Preprocessing of Gene-Expression Data

We processed all raw Affymetrix CEL files (GPL96: U133A; GPL570: U133 Plus 2.0) in a uniform pipeline designed to produce a single, cross-study gene-level matrix suitable for integrative analyses. Series were retrieved with Bioconductor R 4.5.1 (R Foundation for Statistical Computing, Vienna, Austria) package, “GEOquery”, and CEL files were imported and normalized per platform to preserve calibration integrity. Raw CEL files were processed using the Robust Multi-array Average (RMA) algorithm [18,38], implemented in the R “affy” package. The RMA method consists of three sequential steps: background correction of probe intensities, quantile normalization, and robust probe-set summarization to the log₂ scale. Processing at the platform level ensured that identical chemistry and probe design were treated consistently before cross-study mapping.

After normalization, we re-annotated probe sets to current Entrez Gene identifiers (using up-to-date annotation files) and collapsed multiple probes mapping to the same gene by the median expression value. This approach has been a simple and robust rule shown to perform consistently well in survival modeling [39]. To ensure comparability across platforms, we restricted the matrix to the intersection of genes shared by GPL96 and GPL570, yielding 13,039 shared genes. The resulting gene-by-sample matrix was then linked to the harmonized clinical schema described in Section 2.3.1.

Prior to batch correction, principal-component analysis (PCA) [40] revealed pronounced study and platform gradients. We therefore applied ComBat with a covariate-preserving design matrix to adjust for study/platform effects while retaining biological variation associated with key clinical factors (i.e., stage, histology, smoking history, race, sex, and age). This substantially attenuated technical structure and improved cross-study harmonizing in low-dimensional projections, as reported in the Results.

Together, these steps—from raw-level RMA through re-annotation, probe collapsing, cross-platform gene filtering, and covariate-aware batch correction—produced a single harmonized expression matrix for 1347 patients and 13,039 shared genes. This standardized resource forms a stable foundation for downstream analyses, including survival modeling and the evaluation of ACT-benefit biomarkers.

2.3.3. Batch Effect Assessment, Correction, and Quality Control

To ensure data integrity across studies, we first assessed platform- and study-specific batch effects using PCA and Uniform Manifold Approximation and Projection (UMAP) [41]. These complementary techniques enable us to detect and visualize non-biological variation before applying correction methods. PCA was implemented using R’s “prcomp (center = T, scale = T)” function, with the first principal component (PC1) and the second principal component (PC2) explaining percentages of variations. UMAP was run via the “uwot” R package (n_neighbors = 15, min_dist = 0.1) on the 13,039 genes.

To quantify residual batch structure, we computed average silhouette width [42] before and after correction. It measures how well samples are grouped. Silhouette width is calculated for each data point, defined as

S (i) = \frac{b (i) - a (i)}{\max \{a (i), b (i)\}},

where

a (i)

is the average distance from sample

i

to all other samples within its own batch, and

b (i)

is the smallest average distance from sample

i

to all samples in the nearest different batch.

A high average silhouette width (close to 1) means samples within batches are tightly grouped and clearly distinct from other groups, indicating strong batch effects and distinct clusters. On the other hand, a low silhouette width (close to 0) indicates overlapping or mixed groups, suggesting minimal residual batch effects and effective batch correction. A negative silhouette width indicates that samples are not clustered correctly or that batches are mixed.

Building on these exploratory assessments, batch correction was implemented using the empirical Bayes ComBat algorithm to remove unwanted variation while preserving biological signals. Batch factors included Affymetrix microarray platforms and the originating study cohort. A model matrix was specified to retain key clinical covariates (i.e., stage, histology, smoking history, race, sex, and age) during correction to maintain essential biological variability. ComBat models each gene’s expression

Y_{i j}

in sample

j

as

Y_{i j} = α_{i} + X_{j} β_{i} + γ_{i, b (j)} + δ_{i, b (j)} ϵ_{i j},

where

α_{i}

is the overall mean for gene

i,

and

X_{j}

is the design matrix of key clinical covariates with gene-specific effects

β_{i} .

Effects

γ_{i, b (j)}

and

δ_{i, b (j)}

are additive and multiplicative batch effects for batch

b (j),

and

ϵ_{i j}

is the residual error. This method pools data from all genes to more accurately estimate batch effect parameters, making the correction more reliable. This algorithm is implemented in R function “ComBat” in the “sva” package.

After batch correction, we conducted comprehensive array-level quality control to identify and exclude microarray samples with potential instrumental artifacts. We implemented a robust, median-based quality control filter, median M-score deviation method [43] to detect arrays with abnormal overall signal intensities. We computed each array’s M-score as the median

{l o g}_{2}

-transformed expression value across all 13,039 genes. Across the cohort, we summarized the distribution of these M-scores by their interquartile range (IQR) around the cohort-wide median. Arrays deviating by more than 2-fold above or below that cohort median were flagged and subject to removal. This filter detects arrays with unusually high or low overall signal intensities, which often arose from uneven hybridization or scanner malfunctions. Because the median is insensitive to a small number of outlier probes, this filter robustly captures global array-level artifacts without being overly influenced by a few aberrant spots.

To identify outlier microarrays exerting disproportionate influence on the principal component (PC) model, we computed leverage values from the first two principal components of the normalized expression matrix. Leverages were derived from the diagonal of the hat matrix

H = U {(U^{T} U)}^{- 1} U^{T},

where

U

is the matrix of PC scores. We defined the leverage upper threshold as the 99.5th percentile of the distribution.

2.4. Graphical Summary

Figure 2 provides a consolidated overview of our entire workflow as a summary, from systematic data acquisition and LLM-assisted study selection through multi-platform preprocessing, batch-effect assessment, correction, and quality control. Starting with a PRISMA-style search of GEO, we utilized a zero-shot GPT-4.1 mini pipeline for rapid identification and screening of NSCLC microarray cohorts. Raw Affymetrix CEL files were normalized by RMA, probes re-annotated to Entrez IDs, and duplicates collapsed to produce a unified matrix of 13,039 genes. Batch effects were visualized via PCA/UMAP, corrected study/platform effects using covariate-preserving ComBat, and removed array-level outliers with median M-score deviation and PCA leverage. The final harmonized dataset gives support to robust biomarker discovery efforts in early-stage NSCLC.

3. Results

Visualization for the pre-harmonization structure and the need for correction are shown in Figure 3. The PCA and UMAP revealed distinct sample clusters that are primarily separated by microarray platform and by study of origin (i.e., batch effects). These clusters represent technical heterogeneity and reflect non-biological variation, confirming the need for formal batch correction.

After applying ComBat with a covariate-preserving design, platform/study-driven structure collapsed into a single, intermixed cloud, showing improved integration of studies (Figure 4, PCA; Figure 5, UMAP). Comparing Figure 3 (before batch correction) and Figure 4 (after batch correction), we note the substantial reduction in the variance explained. In Figure 3, PC1 explains 38.7% of the total variance, and PC2 explains 24.4% of the total variance. Together, these two components account for 63.1% of the total variability in our dataset. This substantial variance explained by just two principal components (63.1%) indicates very strong batch-related variation prior to correction. The clearly separated clusters reflect substantial differences among batches (e.g., different platforms or study cohorts).

After batch correction (Figure 4), the variance explained substantially drops (PC1: 12.4%, PC2: 7.7%, total 20.1%), explaining effective removal of batch-driven variability and preservation of biological variation. These results in the explained variance from 63.1% to 20.1% confirm that the batch correction procedure by ComBat was highly effective in minimizing unwanted instrumental differences between batches while maintaining the clinical genomic signals necessary for downstream modeling.

Following ComBat, Figure 5 also presents a single, diffuse cloud in which microarrays intermingle extensively. The disappearance of discrete clusters confirms that ComBat has effectively removed platform- and study-specific technical effects.

This qualitative harmonization aligns with quantitative metrics: the average silhouette width computed on batch labels falls from 0.82 before correction to −0.19 after correction, confirming that samples no longer clustered tightly by batch and instead exhibited substantial overlap across studies. Together, these analyses provide compelling evidence that batch correction has successfully attenuated unwanted technical variation, thereby unmasking the true biological diversity within the integrated NSCLC transcriptomic cohort.

As shown in Figure 6, the distribution of M-scores across all 1347 patients (arrays) is unimodal and right-skewed, with many samples falling between 0.05 and 0.10. The cohort-wide median was 0.048. Thus, any array whose M-score is greater than

2 \times 0.048 = 0.096

or less than

0.5 \times 0.048 = 0.024

is subject to removal. Only four microarrays (GSM1213786, GSM1213845, GSM1672568, and GSM370984) exceeded the upper threshold, and none fell below the lower bound (see ‘×’ marks in Figure 4), indicating a generally high level of array quality and minimal residual batch effects. These results support the robustness of the batch correction procedure and the effectiveness of our quality control pipeline. These findings also confirm that the final dataset maintained technical consistency suitable for downstream analysis.

Complementing this, PCA leverage analysis (Figure 7) exhibited a right-skewed leverage distribution (mean 0.00148; standard deviation 0.00146; median 0.00115; IQR 0.00156). An extreme outlier cutoff at the 99.5th percentile (leverage = 0.00829), which corresponds to approximately three standard deviations above the mean, flagged seven arrays in total as high-leverage outliers: the four previously recognized M-score outliers (GSM1213786, GSM1213845, GSM1672568, GSM370984) plus three additional arrays (GSM1672550, GSM1672576, GSM1672587). All seven arrays were subject to removal from the meta-database. These results confirmed a consistent set of outliers across complementary QC metrics, reinforcing our exclusion decisions and ensuring a robust dataset for downstream analyses. Together with clinical-inclusion filtering, these steps yielded the 1340-patient analytic cohort and a harmonized gene-level expression matrix spanning 13,039 intersecting genes.

The clinical and demographic profile of the final cohort is summarized in Table 2. The final meta-database comprises 1340 patients with resected early-stage NSCLC. The cohort’s median and mean age are both 65 years (range, 30–89); 44.9% are female (602/1340) and 55.1% male (738/1340). Disease stage is predominantly early: IA (407, 30.4%) and IB (462, 34.5%) together account for 64.9% of the cohort, followed by stage II (343, 25.6%) and III (128, 9.5%). Histology is mainly adenocarcinoma (923, 68.9%) and squamous cell carcinoma (384, 28.7%), with large cell (31, 2.3%) and adenosquamous (2, 0.1%) subtypes infrequent. Smoking history is documented as yes in 588 (43.9%) and no in 182 (13.6%), with 570 (42.5%) unknown; among those with known status (

n = 770

), 76.4% report a smoking history. Race/ethnicity is largely missing (unknown in 964, 72.2%); among records with known race (

n = 376

), Caucasian accounts for 94.1% (354/376), African American 3.7% (14/376), Asian 1.9% (7/376), and Native Hawaiian 0.3% (1/376). Overall, 223 patients (16.6%) received adjuvant chemotherapy (ACT) and 1117 (83.4%) were managed with observation.

With total events

D = 594

and ACT allocation proportion

p = 0.166,

the standard log-rank events formula with our group allocation,

D \cdot p (1 - p) \cdot {(\ln H R)}^{2} \approx {(z_{1 - \frac{α}{2}} + z_{1 - β})}^{2}

implies

H R \approx \exp \{- \sqrt{\frac{{(1.96 + 0.84)}^{2}}{D} \cdot p (1 - p)}\} \approx 0.74

at 80% power (two-sided

α

= 0.05). We report this as a design-level constraint and align our aims accordingly (heterogeneity-of-treatment-effect (HTE) discovery and benchmarking rather than re-estimating the average ACT effect).

4. Discussion

Our work delivers, to our knowledge, the first open-access, harmonized transcriptomic meta-database that links gene-expression profiles with adjuvant platinum-based chemotherapy (ACT) exposure and overall survival in early-stage NSCLC. Eight Affymetrix microarray cohorts (GPL570 and GPL96) were integrated, yielding 1347 patients and 13,039 genes after uniform RMA normalization, probe re-annotation, and covariate-preserving ComBat correction. Principal-component UMAP analyses confirmed that over 60% of variance attributable to platform and study origin collapsed to around 20% post-correction, while average silhouette width shifted from 0.82 to −0.19, demonstrating successful batch correction.

Our comprehensive quality control framework—comprising median M-score deviation filtering, PCA leverage outlier removal—removed seven arrays to ensure that downstream analyses are conducted on high-quality arrays. The final analytic cohort includes 1340 patients profiled on eight Affymetrix cohorts with 13,039 intersecting genes and 594 survival events. The final dataset’s technical consistency supports its use for robust predictive modeling and biomarker discovery.

From 999 GEO records screened with an LLM-assisted PRISMA workflow, only 8 cohorts satisfied our inclusion/exclusion criteria: patient-level overall survival, explicit ACT status, and raw Affymetrix CEL files to ensure rigorous harmonization and covariate-preserving ComBat correction. This maximizes internal validity and technical comparability for a first open benchmark but necessarily limits the number of included studies and the size of the ACT-treated subgroup. Nevertheless, the integrated resource comprises 1340 patients (ACT

n

= 223; OBS

n

= 1117) with 594 OS events across 13,039 intersecting genes and exhibits substantial attenuation of batch structure (PC1 + PC2 63.1% to 20.1%; silhouette width 0.82 to −0.19) after correction. Using the standard log-rank events formula with our group allocation, those 594 events provide approximately 80% power (two-sided

α

= 0.05) to detect an overall ACT hazard ratio of about 0.74 or stronger. While the database is not primarily designed to re-estimate the modest average ACT effect, it provides the event depth and multi-cohort structure needed for HTE discovery with internal–external validation.

Strengths, Limitations, and Future Directions

By addressing the primary reasons prior ACT-biomarker studies have failed to translate—inconsistent treatment metadata, platform and preprocessing heterogeneity, and uncorrected batch effects—this resource provides a stable foundation on which the community can build and fairly evaluate models intended to personalize adjuvant therapy. The resulting resource offers some principal advantages. First, we offer enhanced statistical power and diversity. Integrating eight publicly available cohorts expands event counts for survival modeling and captures geographic and demographic heterogeneity, improving the generalizability of downstream signatures. Second, we offer ready-to-use and reproducible data objects. All expression and harmonized clinical factors—sex, age, histology, stage, smoking history, race—are provided in a unified format, lowering the barrier to secondary analyses and external validation. Lastly, we offer a scalable framework for future multi-omics integration. The modular workflow (LLM-assisted selection, standardized preprocessing, empirical-Bayes correction, robust quality control) is readily extensible to RNA-seq or other platforms as additional datasets emerge.

Several constraints merit emphasis. Reliance on Affymetrix microarrays to maximize cross-study restricts immediate interoperability with modern sequencing data. As a retrospective meta-analysis, residual confounding cannot be excluded despite covariate harmonization and adjustment. Prospective validation remains essential. Additionally, while LLM-assisted screening streamlines curation, expert review remains important to ensure accurate screening decisions.

In the integrated public cohorts, ACT exposure is available as a binary indicator (received vs. not received) without consistent regimen-level detail (platinum agent, partner drug, cycles, timing, or dose intensity). This constrains modeling in several ways: (i) dilution of effects, because heterogeneous regimens are aggregated into one exposure class, can bias average treatment effects and treatment–biomarker interactions toward the null; (ii) unmodeled effect modification, if biomarker benefit differs by agent or dose, may be obscured; (iii) reduced transportability, when regimen mix varies by cohort or geography; and (iv) fairness assessment limits, if regimen allocation correlates with clinical or demographic factors. Accordingly, the present resource is best suited for benchmarking “ACT vs. observation” decision support and HTE discovery at the level of adjuvant chemotherapy as a class, rather than agent-specific recommendations.

Until regimen details are enriched, users can (a) adjust for preserved clinical covariates (stage, histology, smoking, race/ethnicity, sex, age), (b) perform leave-one-cohort-out validation to test transportability across different regimen mixes, and (c) use hierarchical/partial-pooling models that include cohort-level random effects, which absorb unmeasured, cohort-specific regimen patterns. These measures reduce, but do not eliminate, bias from aggregated ACT exposure—hence our emphasis on prospective validation before clinical deployment.

This resource provides a stable foundation for systematic and transparent benchmarking of classical and deep-learning survival models, identification of robust ACT-benefit signatures in independent clinical trials, method development in treatment-effect modeling, and integration of multi-omics data. Regarding impact and use-cases, this resource is designed to be used, not merely described. By uniting gene expression with curated ACT exposure and harmonized survival across eight Affymetrix cohorts (

n

= 1340; 13,039 genes; 594 events) and by attenuating batch structure with covariate-preserving ComBat, it enables analyses that require both technical consistency and sufficient events for robust validation.

To understand its purpose, we consider who will use it and how it will be used. In the context of reproducible benchmarking of ACT signatures, investigators can reevaluate published ACT-benefit and prognostic signatures under a uniform pipeline and internal–external (leave-one-cohort-out) validation, reporting calibration, discrimination, and treatment-interaction (ACT × signature) effects across cohorts. This addresses the historical reproducibility gap driven by preprocessing and metadata inconsistency. For HTE modeling, method groups can train penalized survival models with treatment-interaction terms or uplift/survival-causal learners to estimate individual treatment effects (e.g., predicted absolute risk reduction from ACT), using cohort-wise validation to assess transportability. The event depth (594 OS events) and preserved covariates (stage, histology, smoking, race/ethnicity, sex, age) support confounding adjustment and subgroup analyses.

This resource can be used for target–trial emulation and weighting analysis. The patient-level ACT indicator and survival enable propensity-weighted comparisons and sensitivity analyses to explore real-world effectiveness signals in preparation for prospective testing. For transportability and fairness audits, because clinical covariates are harmonized, users can quantify model stability across stage, histology, sex, age, and smoking history (and race/ethnicity where available), identifying domains where recalibration or enrichment is required before translation. Finally, the integrated cohort supports trial planning and simulation of enrichment strategies (e.g., randomizing only high-predicted-benefit patients) and sample-size/power calculations anchored to observed event rates and effect sizes from internal–external validation, informing prospective ACT-biomarker trials.

Regarding scope and path to translation, this meta-database is intended for method development, benchmarking, and hypothesis generation. It does not replace randomized evidence; rather, it provides a standardized testbed to (i) prioritize signatures/models, (ii) define decision thresholds (e.g., via decision-curve analysis), and (iii) specify inclusion criteria for prospective validation. By lowering the barrier to robust, multi-cohort evaluation, the resource clarifies which approaches merit clinical testing and accelerates fair comparisons across methods.

As a future study, we are extending the workflow to incorporate cross-platform (Affymetrix/Illumina/Agilent) microarrays through gene-level intersection and cross-platform normalization/merging, maintained as an expanded analysis layer to test transportability. Furthermore, RNA-seq will harmonize sequencing studies under a parallel pipeline (e.g., voom/ComBat-Seq) with late-fusion or model-stacking strategies to integrate predictions rather than forcing early-stage feature fusion. In parallel, we will continue LLM-assisted extraction of ACT metadata from GEO supplements and associated publications and will contact corresponding authors where ACT fields are incomplete to enrich regimen details. This pragmatic roadmap preserves a reproducible Core dataset for benchmarking while enabling broader coverage for sensitivity analyses and external generalization studies. Prospective testing of emergent signatures in ongoing ACT trials remains the critical step toward clinical translation.

For planning regimen enrichment and partnerships, we will expand the meta-database with regimen-level fields via three channels in a future study: (i) LLM-assisted parsing of full texts and supplements linked to each GEO study to capture platinum agent, partner, cycles, schedule (e.g., q3-week), start window post-resection, and available dose or relative dose intensity; (ii) systematic contact with corresponding authors to obtain regimen summaries or case-level annotations where permissible; and (iii) mapping regimen data to a controlled vocabulary (agent names, doublet vs. single-agent flags, cycle counts, start timing bins, and dose-intensity categories) to support multi-level exposure modeling. To preserve reproducibility, we will version the resource as a stable Level-1 (Core) dataset with binary ACT and a Level-2 (Regimen-labeled) subset for studies with high-confidence regimen detail, enabling sensitivity analyses that explicitly model agent, schedule, and dose. We will flag coverage and completeness in a public data dictionary so users can select the appropriate tier for their question.

In the integrated cohort, race/ethnicity is recorded as “Unknown” for 964/1340 patients (71.9%), leaving only 376 cases with known values; among these, 94.1% are Caucasian, indicating limited representation of minoritized groups. This profile constrains the database’s immediate utility for health-disparity research and for training or auditing fair AI models, and it limits the generalizability of findings to populations under-captured here. We encoded race/ethnicity as an explicit “Unknown” category to preserve sample size and comparability across studies, consistent with our harmonization rules; however, this practice should be understood as retaining data, not resolving demographic under-representation. Accordingly, we recommend (i) reporting model discrimination, calibration, and treatment-interaction metrics separately for known vs. unknown race/ethnicity; (ii) restricting race-stratified evaluations to the known subset; and (iii) conducting internal–external validation to assess whether performance shifts with cohort-level differences in demographic capture.

Our future study includes an enrichment plan that prioritizes onboarding additional public cohorts with complete race/ethnicity fields, structured author outreach to retrieve missing annotations when permissible, and versioning a “fairness-ready” subset with high demographic completeness and standardized labels to support transportability and fairness audits. We do not impute race/ethnicity for fairness evaluation; instead, we emphasize improved data capture, transparent reporting, and external validation in datasets with reliable demographic fields.

5. Conclusions

We created the first open-access, high-quality, LLM-curated, batch-corrected genomic meta-database linking gene expression to ACT outcomes for early-stage NSCLC that overcomes the historical limitations of a small sample size, platform heterogeneity, and inconsistent clinical annotation. Rigorous preprocessing and batch correction reduce instrumental variability to levels compatible with biomarker discovery, while minimal sample loss attests to data integrity. The resource is designed for fair, head-to-head benchmarking of ACT-benefit models and for generating candidates worthy of prospective validation. This work aims to accelerate the data-driven selection of adjuvant chemotherapy in early-stage NSCLC and advance more equitable precision oncology.

Author Contributions

H.M., M.Y.C. and O.S. had full access to all study data and accept responsibility for the integrity of the data and the accuracy of the analyses. Conceptualization, H.M.; Methodology, H.M., M.Y.C. and O.S.; Software, O.S. (LLM-assisted study-screening app); Validation, M.Y.C., O.S., K.L., G.K., K.K. (Kaden Kwak), K.K. (Koeun Kwak) and A.C.T.; Formal statistical analysis, M.Y.C., K.L., K.K. (Kaden Kwak), K.K. (Koeun Kwak) and A.C.T.; Investigation, O.S., K.L., G.K., K.K. (Kaden Kwak), K.K. (Koeun Kwak) and A.C.T.; Resources, H.M. and O.S.; Data curation (reference management), H.M., K.L., G.K., K.K. (Kaden Kwak), K.K. (Koeun Kwak) and A.C.T.; Writing—original draft preparation, H.M.; Section attributions: Introduction—H.M., K.L., G.K., K.K. (Kaden Kwak), K.K. (Koeun Kwak) and A.C.T.; Methodology—H.M., M.Y.C. and O.S.; Results—H.M., M.Y.C., O.S., K.L., G.K. and K.K. (Kaden Kwak); Discussion and Conclusions—H.M. Writing—review and editing, all authors; Supervision, H.M.; Project administration, H.M.; Funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available on GitHub and Zenodo (https://github.com/osun24/NSCLC-Adjuvant-Chemo-Database; https://doi.org/10.5281/zenodo.17215636).

Acknowledgments

Hojin Moon’s research was supported in part by the Research, Scholarship, and Creative Activity (RSCA) Program and the Undergraduate Research Opportunity Program (UROP) at CSULB. Co-authors K.L., G.K., K.K. (Kaden Kwak), K.K. (Koeun Kwak), and A.C.T. substantially contributed to this study as high school research interns under the supervision of H.M., representing Irvine High School (K.L.), Portola High School (G.K., K.K. (Koeun Kwak)), Northwood High School (K.K. (Kaden Kwak)), and Gretchen Whitney High School (A.C.T.). The authors thank Nicholas A. Zarus for facilitating the release of our datasets on Zenodo. N.A.Z. is currently participating in the KURE research program at CSULB. Portions of the manuscript were refined for grammar with the assistance of a large-language model; all authors reviewed and approved the content and accept responsibility for the accuracy and integrity of the work.

Conflicts of Interest

The author declares no conflicts of interest.

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
Molina, J.R.; Yang, P.; Cassivi, S.D.; Schild, S.E.; Adjei, A.A. Non-Small Cell Lung Cancer: Epidemiology, Risk Factors, Treatment, and Survivorship. Mayo Clin. Proc. 2008, 83, 584–594. [Google Scholar] [CrossRef]
Goldstraw, P.; Chansky, K.; Crowley, J.; Rami-Porta, R.; Asamura, H.; Eberhardt, W.E.; Nicholson, A.G.; Groome, P.; Mitchell, A.; Bolejack, V.; et al. The IASLC Lung Cancer Staging Project: Proposals for Revision of the TNM Stage Groupings in the Forthcoming Eighth Edition. J. Thorac. Oncol. 2016, 11, 39–51. [Google Scholar] [CrossRef]
Detterbeck, F.C.; Boffa, D.J.; Kim, A.W.; Tanoue, L.T. The Eighth Edition Lung Cancer Stage Classification. Chest 2017, 151, 193–203. [Google Scholar] [CrossRef]
Arriagada, R.; Bergman, B.; Dunant, A.; Le Chevalier, T.; Pignon, J.P.; Vansteenkiste, J.; International Adjuvant Lung Cancer Trial Collaborative Group. Cisplatin-Based Adjuvant Chemotherapy in Patients with Completely Resected Non-Small-Cell Lung Cancer. N. Engl. J. Med. 2004, 350, 351–360. [Google Scholar] [CrossRef]
Winton, T.; Livingston, R.; Johnson, D.; Rigas, J.; Johnston, M.; Butts, C.; Cormier, Y.; Goss, G.; Inculet, R.; Vallieres, E.; et al. Vinorelbine plus Cisplatin vs. Observation in Resected Non-Small-Cell Lung Cancer. N. Engl. J. Med. 2005, 352, 2589–2597. [Google Scholar] [CrossRef]
Douillard, J.Y.; Rosell, R.; De Lena, M.; Carpagnano, F.; Ramlau, R.; Gonzáles-Larriba, J.L.; Grodzki, T.; Pereira, J.R.; Le Groumellec, A.; Lorusso, V.; et al. Adjuvant Vinorelbine plus Cisplatin versus Observation in Completely Resected Stage IB–IIIA Non-Small-Cell Lung Cancer (ANITA). Lancet Oncol. 2006, 7, 719–727. [Google Scholar] [CrossRef] [PubMed]
National Comprehensive Cancer Network (NCCN). Non-Small Cell Lung Cancer. Version 2.2025. In NCCN Clinical Practice Guidelines in Oncology; National Comprehensive Cancer Network (NCCN): Plymouth Meeting, PA, USA, 2025. [Google Scholar]
Zhu, C.Q.; Ding, K.; Strumpf, D.; Weir, B.A.; Meyerson, M.; Pennell, N.; Thomas, R.K.; Naoki, K.; Ladd-Acosta, C.; Liu, N.; et al. Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung Cancer. J. Clin. Oncol. 2010, 28, 4417–4424. [Google Scholar] [CrossRef] [PubMed]
Chen, H.-Y.; Yu, S.-L.; Chen, C.-H.; Chang, G.-C.; Chen, C.-Y.; Yuan, A.; Cheng, C.-L.; Wang, C.-H.; Terng, H.-J.; Kao, S.-F.; et al. A five-gene signature and clinical outcome in non–small-cell lung cancer. N. Engl. J. Med. 2007, 356, 11–20. [Google Scholar] [CrossRef]
Chen, D.-T.; Hsu, Y.-L.; Fulp, W.J.; Coppola, D.; Haura, E.B.; Yeatman, T.J.; Cress, W.D. Prognostic and Predictive Value of a Malignancy-Risk Gene Signature in Early-Stage Non–Small Cell Lung Cancer. J. Natl. Cancer Inst. 2011, 103, 1859–1870. [Google Scholar] [CrossRef] [PubMed]
Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma; Shedden, K.; Taylor, J.M.; Enkemann, S.A.; Tsao, M.S.; Yeatman, T.J.; Gerald, W.L.; Eschrich, S.; Jurisica, I.; Giordano, T.J.; et al. Gene Expression-Based Survival Prediction in Lung Adenocarcinoma: A Multi-Site, Blinded Validation Study. Nat. Med. 2008, 14, 822–827. [Google Scholar] [CrossRef]
Bepler, G.; Olaussen, K.A.; Vataire, A.L.; Soria, J.-C.; Zheng, Z.; Dunant, A.; Pignon, J.-P.; Schell, M.J.; Fouret, P.; Pirker, R.; et al. ERCC1 and RRM1 in the International Adjuvant Lung Trial by Automated Quantitative In Situ Analysis. Am. J. Pathol. 2011, 178, 69–78. [Google Scholar] [CrossRef]
Kadara, H.; Behrens, C.; Yuan, P.; Solis, L.; Liu, D.; Gu, X.; Minna, J.D.; Lee, J.J.; Kim, E.; Hong, W.-K.; et al. A Five-Gene and Corresponding Protein Signature for Stage I Lung Adenocarcinoma Prognosis. Clin. Cancer Res. 2011, 17, 1490–1501. [Google Scholar] [CrossRef]
Subramanian, J.; Simon, R. Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use? J. Natl. Cancer Inst. 2010, 102, 464–474. [Google Scholar] [CrossRef] [PubMed]
Botling, J.; Edlund, K.; Lohr, M.; Hellwig, B.; Holmberg, L.; Lambe, M.; Berglund, A.; Ekman, S.; Bergqvist, M.; Pontén, F.; et al. Biomarker Discovery in Non-Small Cell Lung Cancer: Integrating Gene Expression Profiling, Meta-Analysis, and Tissue Microarray Validation. Clin. Cancer Res. 2013, 19, 194–204. [Google Scholar] [CrossRef]
Tang, H.; Wang, S.; Xiao, G.; Schiller, J.; Papadimitrakopoulou, V.; Minna, J.; Wistuba, I.I.; Xie, Y. Comprehensive Evaluation of Published Gene-Expression Prognostic Signatures for Lung Cancer. Ann. Oncol. 2017, 28, 733–740. [Google Scholar] [CrossRef]
Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, Normalization, and Summaries of High-Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4, 249–264. [Google Scholar] [CrossRef]
Dai, M.; Wang, P.; Boyd, A.D.; Kostov, G.; Athey, B.; Jones, E.G.; Bunney, W.E.; Myers, R.M.; Speed, T.P.; Akil, H.; et al. Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data. Nucleic Acids Res. 2005, 33, e175. [Google Scholar] [CrossRef] [PubMed]
Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics 2007, 8, 118–127. [Google Scholar] [CrossRef]
Baty, F.; Facompré, M.; Kaiser, S.; Schumacher, M.; Pless, M.; Bubendorf, L.; Savic, S.; Marrer, E.; Budach, W.; Buess, M.; et al. Gene Profiling of Clinical Routine Biopsies and Prediction of Survival in Non-Small Cell Lung Cancer. Am. J. Respir. Crit. Care Med. 2010, 181, 181–188. [Google Scholar] [CrossRef] [PubMed]
Harrell, F.E., Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd ed.; Springer: New York, NY, USA, 2015. [Google Scholar]
Shen, Y.; Huang, J.; He, H. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. Brief. Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef]
Sun, O.; Cheuk, M.; Moon, H. Large Language Models Empower Meta-Analysis in the Big Data Era. Extended Abstract. In Proceedings of the Joint Statistical Meetings (JSM 2025), Nashvile, TN, USA, 2–7 August 2025. [Google Scholar]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
Selenium, H.Q. Selenium [Internet]. GitHub. 26 November 2024. Available online: https://github.com/SeleniumHQ/selenium (accessed on 13 December 2024).
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the Potential of Prompt Engineering for Large Language Models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Xiao, G.; Coombes, K.R.; Behrens, C.; Solis, L.M.; Raso, G.; Girard, L.; Erickson, H.S.; Roth, J.; Heymach, J.V.; et al. Robust Gene Expression Signature from Formalin-Fixed Paraffin-Embedded Samples Predicts Prognosis of Non-Small-Cell Lung Cancer Patients. Clin. Cancer Res. 2011, 17, 5705–5714. [Google Scholar] [CrossRef]
Jabs, V.; Edlund, K.; König, H.; Grinberg, M.; Madjar, K.; Rahnenführer, J.; Ekman, S.; Bergkvist, M.; Holmberg, L.; Ickstadt, K.; et al. Integrative Analysis of Genome-Wide Gene Copy Number Changes and Gene Expression in Non-Small Cell Lung Cancer. PLoS ONE 2017, 12, e0187246. [Google Scholar] [CrossRef]
Lohr, M.; Hellwig, B.; Edlund, K.; Mattsson, J.S.M.; Botling, J.; Schmidt, M.; Hengstler, J.G.; Micke, P.; Rahnenführer, J. Identification of Sample Annotation Errors in Gene Expression Datasets. Arch. Toxicol. 2015, 89, 2265–2272. [Google Scholar] [CrossRef]
Goldmann, T.; Marwitz, S.; Nitschkowski, D.; Krupar, R.; Backman, M.; Elfving, H.; Thurfjell, V.; Lindberg, A.; Brunnström, H.; La Fleur, L.; et al. PD-L1 Amplification Is Associated with an Immune Cell Rich Phenotype in Squamous Cell Cancer of the Lung. Cancer Immunol. Immunother. 2021, 70, 2577–2587. [Google Scholar] [CrossRef]
Khadse, A.; Haakensen, V.D.; Silwal-Pandit, L.; Hamfjord, J.; Micke, P.; Botling, J.; Brustugun, O.T.; Lingjærde, O.C.; Helland, Å.; Kure, E.H. Prognostic Significance of the Loss of Heterozygosity of KRAS in Early-Stage Lung Adenocarcinoma. Front. Oncol. 2022, 12, 873532. [Google Scholar] [CrossRef]
Okayama, H.; Kohno, T.; Ishii, Y.; Shimada, Y.; Shiraishi, K.; Iwakawa, R.; Furuta, K.; Tsuta, K.; Shibata, T.; Yamamoto, S.; et al. Identification of Genes Upregulated in ALK-Positive and EGFR/KRAS/ALK-Negative Lung Adenocarcinomas. Cancer Res. 2012, 72, 100–111. [Google Scholar] [CrossRef] [PubMed]
Yamauchi, M.; Yamaguchi, R.; Nakata, A.; Kohno, T.; Nagasaki, M.; Shimamura, T.; Imoto, S.; Saito, A.; Ueno, K.; Hatanaka, Y.; et al. Epidermal Growth Factor Receptor Tyrosine Kinase Defines Critical Prognostic Genes of Stage I Lung Adenocarcinoma. PLoS ONE 2012, 7, e43923. [Google Scholar] [CrossRef]
Der, S.D.; Sykes, J.; Pintilie, M.; Zhu, C.-Q.; Strumpf, D.; Liu, N.; Jurisica, I.; Shepherd, F.A.; Tsao, M.-S. Validation of a Histology-Independent Prognostic Gene Signature for Early-Stage, Non-Small-Cell Lung Cancer Including Stage IA Patients. J. Thorac. Oncol. 2014, 9, 59–64. [Google Scholar] [CrossRef] [PubMed]
Bueno, R.; Richards, W.G.; Harpole, D.H.; Ballman, K.V.; Tsao, M.-S.; Chen, Z.; Wang, X.; Chen, G.; Chirieac, L.R.; Chui, M.H.; et al. Multi-Institutional Prospective Validation of Prognostic mRNA Signatures in Early-Stage Squamous Lung Cancer (Alliance). J. Thorac. Oncol. 2020, 15, 1748–1757. [Google Scholar] [CrossRef]
Bolstad, B.M. Pre-Processing DNA Microarray Data. In Fundamentals of Data Mining in Genomics and Proteomics; Dubitzky, W., Granzow, M., Berrar, D.P., Eds.; Springer: Boston, MA, USA, 2007; pp. 51–78. [Google Scholar]
Ballman, K.V.; Grill, D.E.; Oberg, A.L.; Therneau, T.M. Faster Cyclic Loess: Normalizing RNA Arrays via Linear Models. Bioinformatics 2004, 20, 2778–2786. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal Component Analysis, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Kauffmann, A.; Gentleman, R.; Huber, W. ArrayQualityMetrics—A Bioconductor Package for Quality Assessment of Microarray Data. Bioinformatics 2009, 25, 415–416. [Google Scholar] [CrossRef] [PubMed]

Figure 1. PRISMA flow diagram of the systematic study selection process with GPT-4.1-mini LLM aided record screening. * GSE31908 was not associated with a report and the data contained patient-specific overall survival and adjuvant chemotherapy administration status.

Figure 2. Methodology flowchart summarizing research, curation, preprocessing, correction, and quality control steps.

Figure 3. Visualization of pre-harmonization structure by PCA (left) and UMAP (right).

Figure 4. PCA of ComBat-corrected expression profiles. Quality control outliers are overlaid: “×” marks arrays flagged by the median M-score deviation filter (abnormal global signal), and “+” marks high-leverage arrays from PCA (disproportionate influence). All flagged arrays were excluded from the analytic cohort.

Figure 5. UMAP after batch correction.

Figure 6. Histogram of m-scores; the blue line indicates the median and the red line indicates the upper threshold.

Figure 7. Histogram of PCA leverage (after ComBat correction); the red dotted line indicates the cutoff for extreme outlier at the 99.5th percentile.

Table 1. Summary of selected 8 GEO studies.

Platform	GEO Series	Patients (Total/ACT)	Probe Sets (Raw)
GPL570	GSE29013; GSE37745; GSE31908; GSE31210 ^a; GSE50081 ^a; GSE157010 ^a	788/65	54,675
GPL96	GSE68465, GSE14814	559/159	22,283

Note. ^a GSE31210, GSE50081, and GSE157010 exclusively contained patients not treated with ACT.

Table 2. Summary of clinical features.

Feature	Level	Cohort (n = 1361)
Age, years	Median	65
	Range	30–89
	Mean	65
$Sex, n$ (%)	Female	602 (44.9)
	Male	738 (55.1)
$Race, n$ (%)	Caucasian	354 (26.4)
	African American	14 (1.0)
	Asian	7 (0.5)
	Native Hawaiian	1 (0.1)
	¹ Unknown	964 (71.9)
$Stage, n$ (%)	IA	407 (30.4)
	IB	462 (34.5)
	II	343 (25.6)
	III	128 (9.5)
$Histology, n$ (%)	Adenocarcinoma	923 (68.9)
	Squamous Cell Carcinoma	384 (28.7)
	Large Cell Carcinoma	31 (2.3)
	Adenosquamous Carcinoma	2 (0.1)
$Smoking, n$ (%)	Yes	588 (43.9)
	No	182 (13.6)
	Unknown	570 (42.5)
ACT, (%)	Yes	223 (16.6)
	No	1117 (83.4)

Note. ¹ Categorical clinical variables with absent entries were harmonized using an explicit “Unknown” category to preserve comparability across cohorts and avoid listwise deletion. Percentages are column percentages; values may not sum to 100% due to rounding.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moon, H.; Cheuk, M.Y.; Sun, O.; Lee, K.; Kim, G.; Kwak, K.; Kwak, K.; Tam, A.C. An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Appl. Sci. 2025, 15, 10733. https://doi.org/10.3390/app151910733

AMA Style

Moon H, Cheuk MY, Sun O, Lee K, Kim G, Kwak K, Kwak K, Tam AC. An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Applied Sciences. 2025; 15(19):10733. https://doi.org/10.3390/app151910733

Chicago/Turabian Style

Moon, Hojin, Michelle Y. Cheuk, Owen Sun, Katherine Lee, Gyumin Kim, Kaden Kwak, Koeun Kwak, and Aaron C. Tam. 2025. "An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer" Applied Sciences 15, no. 19: 10733. https://doi.org/10.3390/app151910733

APA Style

Moon, H., Cheuk, M. Y., Sun, O., Lee, K., Kim, G., Kwak, K., Kwak, K., & Tam, A. C. (2025). An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer. Applied Sciences, 15(19), 10733. https://doi.org/10.3390/app151910733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Open, Harmonized Genomic Meta-Database Enabling AI-Based Personalization of Adjuvant Chemotherapy in Early-Stage Non-Small Cell Lung Cancer

Abstract

1. Introduction

2. Methodology

2.1. Study Identification and LLM Screening

2.2. Inclusion and Exclusion Criteria

2.3. Preprocessing of Metadata

2.3.1. Clinical Data Preparation

2.3.2. Preprocessing of Gene-Expression Data

2.3.3. Batch Effect Assessment, Correction, and Quality Control

2.4. Graphical Summary

3. Results

4. Discussion

Strengths, Limitations, and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI