SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer

Shirvaliloo, Milad

doi:10.3390/data11070152

Open AccessData Descriptor

SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer

by

Milad Shirvaliloo

^1,2

¹

Finetech in Medicine Research Center, Iran University of Medical Sciences, Tehran 1449614525, Iran

²

Faculty of Medicine, Tabriz University of Medical Sciences, Tabriz 5166614766, Iran

Data 2026, 11(7), 152; https://doi.org/10.3390/data11070152 (registering DOI)

Submission received: 30 April 2026 / Revised: 17 June 2026 / Accepted: 18 June 2026 / Published: 23 June 2026

(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Small cell lung cancer (SCLC) is an aggressive neuroendocrine malignancy with limited publicly available spatial transcriptomic resources, particularly for extensive-stage disease (ES-SCLC), which remains absent from major initiatives such as The Cancer Genome Atlas (TCGA). To improve accessibility, interoperability, and downstream analytical utility of existing spatial transcriptomic data, SCAPeSCLC was developed as a harmonized dataset derived from two publicly available Gene Expression Omnibus (GEO) series, GSE261345 and GSE261348, generated using the NanoString GeoMx Digital Spatial Profiler platform. The resource integrates normalized expression measurements from 296 tumor regions of interest (ROI) across 58 ES-SCLC patients treated with first-line chemoimmunotherapy. Normalized expression matrices were reformatted into survival-ready column-based datasets at both ROI and patient levels following log₂-transformation and standardization. Clinical metadata were curated and harmonized, and progression-free survival (PFS), disease-specific survival (DSS), overall survival (OS), time-on-treatment (ToT), follow-up intervals, and censoring indicators were reconstructed from the original clinical records. Biological pathway (BP) activity scores were generated using Cancer Transcriptome Atlas (CTA) annotations encompassing 106 BPs. To account for variable ROI sampling across patients, Bayesian hierarchical modeling was applied to estimate patient-level pathway activity, yielding posterior estimates and corresponding credible intervals. The resulting resource includes harmonized expression matrices, pathway enrichment profiles, Bayesian posterior estimates, survival-ready clinical annotations, and standardized Cox proportional hazards modeling outputs, along with a dedicated GitHub repository. SCAPeSCLC is intended to facilitate confirmatory analyses, integrative statistical modeling, methodological benchmarking, and reproducible exploration of spatial transcriptomic determinants of survival in ES-SCLC.

Dataset:https://doi.org/10.5281/zenodo.19897645.

Dataset License: Creative Commons Attribution 4.0 International

Keywords:

extensive stage small cell lung cancer; ES-SCLC; CANTABRICO; IMfirst; Cancer Transcriptome Atlas; GeoMx DSP; digital spatial profiling; spatial transcriptomics; Bayesian analysis

Graphical Abstract

1. Summary

Small cell lung cancer (SCLC) is an invasive neoplasm of neuroendocrine origin that accounts for roughly 15% of all lung malignant tumors, with a global five-year survival rate of 7% due to rapid progression [1]. Nearly 70% of SCLC patients are diagnosed with the advanced form of the disease known as extensive-stage SCLC (ES-SCLC), which is characterized by poor response to first-line chemoimmunotherapeutic interventions with unfavorable survival outcomes [2]. This has prompted investigators to evaluate the efficacy of emerging treatment options combined with chemotherapeutics in ES-SCLC patients [3,4]. However, effective development of therapeutic regimens requires timely identification of tumor-specific targets. While The Cancer Genome Atlas (TCGA) facilitates biomarker discovery and the exploration of survival outcomes in a total of 33 tumor types, SCLC or ES-SCLC is not covered in this database [5].

The Nanostring GeoMx Digital Spatial Profiler (DSP) is a cutting-edge spatial biology platform used for profiling ribonucleic acid (RNA) or protein expression in formalin-fixed paraffin-embedded (FFPE) tissue specimens collected from patients with cancer [6]. It benefits from an in-house library of tumor-related genes known as the Cancer Transcriptome Atlas (CTA) that currently covers over 1800 genes and pertinent biological pathways [7]. In spite of its capability for addressing spatial heterogeneity in tumor samples, GeoMx DSP has rarely been used for gene expression measurements in SCLC. According to the literature and data available on public repositories, to date, only two peer-reviewed studies have successfully incorporated DSP to study SCLC [8,9]. Both published in 2024, the work by Pressini et al. reports transcriptome mapping of 58 ES-SCLC patients from CANTABRICO and IMfirst cohorts receiving first-line chemoimmunotherapy [8], while the work by Zhang et al. focused on intra-tumoral molecular heterogeneity in SCLC and was limited to 25 patients [9]. Important advantages associated with the former study are the higher number of patients, multiple samples per patient in the form of regions of interest (ROI), between-cohort baseline similarities, and deposition of the complete spatial transcriptome on Gene Expression Omnibus (GEO).

The present Data Descriptor describes the Spatial Cancer Transcriptome Atlas of Pathway Enrichment in extensive-stage SCLC (SCAPeSCLC), a harmonized derivative resource generated from the CANTABRICO and IMfirst cohorts. Beyond integration of ROI-level transcriptomic and clinical data, SCAPeSCLC provides standardized ROI- and patient-level expression matrices, harmonized survival endpoints and censoring variables, CTA biological pathway enrichment profiles, Bayesian hierarchical pathway activity estimates with credible intervals, and survival modeling outputs accompanied by fully reproducible analytical workflows in R version 4.5.2, which were generated during curation and are not available in the original GEO submissions.

2. Data Description

2.1. Dataset Components and Characteristics

The present dataset was curated by merging two parallel clinical trials focused on programmed death ligand 1 (PD-L1) inhibitors, namely atezolizumab (EudraCT Number: 2019-002784-10) and durvalumab (EudraCT Number: 2020-002328-35), involving 32 and 26 ES-SCLC patients of Hispanic descent treated at the 12 de Octubre Research Institute in Madrid, Spain. The corresponding data series, GSE261345 and GSE261348, were made available on the GEO in March 2024 by the original investigators and included patient metadata and event dates (Table 1).

These data series contained multiple tumor samples per patient in the form of ROIs, which were processed and sequenced for over 1700 CTA genes using the NanoString GeoMx DSP platform [11]. The rationale for merging these datasets included (1) recruitment from the same clinical research institute during overlapping study periods, (2) use of identical GeoMx DSP and CTA workflows [12], (3) consistent normalization procedures, (4) collection of tumor specimens prior to treatment initiation, and (5) administration of first-line chemoimmunotherapy regimens based on PD-L1 inhibition.

2.2. Patient Demographics and Baseline Clinical Characteristics

Datasheet D1 includes the demographic information and clinical characteristics of 58 ES-SCLC patients from atezolizumab (n = 32) and durvalumab (n = 26) treatment arms. Treatment-related variables incorporated into the dataset include PD-L1 inhibitor assignment (atezolizumab or durvalumab), platinum agent (carboplatin or cisplatin), and chemotherapy backbone (etoposide). Detailed information regarding subsequent lines of therapy, treatment modifications, and patient-specific dosing was not available in the original source datasets. While the baseline chemoimmunotherapy regimen was broadly comparable across the combined cohort, demographics and baseline clinical characteristics were statistically compared between treatment arms to assess cohort compatibility and identify potential confounding variables (see Table 2).

The mean age across the 58 patients in the combined dataset was 64.5 ± 8.02 years. The prevalence of male sex was 67.2%, corresponding to a female-to-male ratio of 1:2.05, and the prevalence of current and former smoking was 48.3% and 51.7%, respectively. No statistically significant differences were observed between the two treatment arms except for bone metastasis, with a prevalence of 46.15% in the durvalumab arm and 12.5% in the atezolizumab arm (see Table 2). While this contrast was independent of therapy assignment, and this dataset is not intended for between-treatment comparisons, users should exercise caution when conducting outcome analyses using the current dataset. The overall prevalence of metastasis, however, was comparable between the two arms.

Principal Component Analysis of Cohort Structure

Because the dataset was generated by merging two independent GEO series, principal component analysis (PCA) was performed using patient-level log₂-transformed expression values to assess global transcriptomic similarity between cohorts.

Visual inspection of the first two principal components demonstrated substantial overlap between the atezolizumab and durvalumab cohorts without complete separation (Figure 1). PC1 and PC2 explained 42.8% and 8.4% of the total variance, respectively (see Table S1).

To further evaluate potential cohort-related structure, PC1 and PC2 scores were examined in multivariable linear regression models including treatment arm, age, sex, smoking status, ECOG performance status, platinum agent, and metastatic sites. Treatment arm and sex were associated with PC1, whereas bone metastasis was not (Table S2).

Because all tumor specimens were obtained before treatment initiation and the two cohorts originated from the same institution using identical high-dimensional GeoMx DSP workflows, the observed variation is likely to reflect a combination of biological and cohort-specific factors rather than a clear technical batch effect. Consequently, no batch-correction procedure was applied, and the original normalized expression values were retained to preserve the biological signal.

2.3. Expression of Cancer Transcriptome Atlas Genes

2.3.1. ROI Allocation and Global Gene Expression Distributions

The dataset comprises 296 ROIs, with sample and patient metadata provided in two separate sheets containing log₂-transformed DSP values and scaled log₂-transformed values (z-scores) for 1722 CTA genes formatted in columns rather than rows for ease of use. These data are available in Datasheets D2–3 deposited in Zenodo.

Datasheets D4–5 contain patient metadata and survival data, along with patient-level log₂-transformed DSP values and scaled log₂-transformed values for the same 1722 CTA genes. These values were averaged from ROI-level measurements and formatted in columns (see Figure 2).

Figure 2A–C show the distribution of gene expression densities across the entire cohort, individual patients, and treatment arms. Across all 99,876 measurements, the global median gene expression (log₂) is 7.75 (IQR 7.34–8.26). When stratified by treatment arm, the mean expression values are 7.84 (IQR 7.45–8.25) for atezolizumab and 7.64 (IQR 7.18–8.25) for durvalumab. Figure 2D illustrates the distribution of ROIs at the patient level, with patient identifiers labeled accordingly.

2.3.2. SCLC Subtype Markers and Key Cancer-Related Genes

Conventionally, SCLC is categorized into four primary subtypes, including SCLC-A, SCLC-N, SCLC-P and SCLC-Y that are characterized by dominant expression of ASCL1, NEUROD1, POU2F3 and YAP1, respectively. More recently released classification systems favor the inflamed variant of SCLC, known as SCLC-I, over SCLC-Y. The inflamed variant is characterized by low expression of all three essential markers, i.e., ASCL1, NEUROD1 and POU2F3 [13]. In the current work, the former classification system was adopted for data exploration purposes (see Figure 3).

Figure 3A shows density histograms of the four mentioned genes in the combined cohort. This is supplemented by transcription box plots and median percentile metrics in Figure 3B. Notice how ASCL1 dominates the remaining three markers by a large margin. While data regarding SCLC subtypes in CANTABRICO and IMfirst cohorts were not explicitly reported by the original investigators, the combined cohort demonstrated substantially higher ASCL1 expression relative to NEUROD1, POU2F3 and YAP1. This is biologically consistent given how recent studies have reported higher frequencies of the SCLC-A subtype along with lower proportions of the SCLC-N and SCLC-P variants in separate cohorts [13,14,15,16]. Although this pattern supports neuroendocrine identity in the combined cohort, it should be noted that true subtype classification requires per-sample analysis rather than global medians. Users can find ranked median percentiles in Supplementary File S1.

According to the Memorial Sloan Kettering (MSK) mutation profiling of 341 key cancer-related genes in over 280 tumor samples from different cancers, the most frequently altered genes in SCLC include TP53 with the highest reported frequency, followed by RB1, KMT2D, PTEN, NOTCH1, CREBBP, NF1, APC, EGFR, KRAS, NOTCH3, ARID1A, ATRX and EP300 [17]. These SCLC-related alterations were thoroughly reviewed in the form of a Disease Primer [18]. Figure 4 visualizes the transcription levels of 14 SCLC-related genes in the present dataset. This plot is chiefly intended for exploratory purposes, and transcriptional levels of these genes may not accurately reflect their mutational status.

2.3.3. Patient Clustering Based on Subtype Marker Expression

To better characterize the subtype composition of the combined cohort, patients were clustered using unsupervised model-based clustering based on the expression of ASCL1, NEUROD1, POU2F3, and YAP1. These markers were selected according to established transcriptional subtype frameworks for SCLC [18] and were evaluated in the original source study [8]. Cluster assignments were incorporated into cohort-level visualizations and are provided in Supplementary File S2. Detailed clustering visualizations are provided in Figures S1–S4. As a sensitivity analysis, alternative clustering strategies and marker combinations were explored, including hierarchical clustering, neighborhood-based clustering, and model-based clustering incorporating MYC expression due to its emerging role as a driving factor involved in the temporal evolution of SCLC subtypes [19]. Comparative performance metrics are summarized in Table S3. These exploratory analyses did not materially improve cluster stability or biological interpretability relative to the selected model-based framework, which was therefore retained for cohort characterization.

Unsupervised model-based clustering identified four clusters (R² = 0.648, BIC = 143.5, silhouette score = 0.300, see Table S4). Although the silhouette score indicates moderate cluster separation, the resulting clusters demonstrated biologically coherent subtype patterns and revealed two ASCL1⁻/NEUROD1⁻/POU2F3⁺ patients, consistent with the relatively low prevalence of the SCLC-P subtype reported in previous cohorts [13,14,15,16].

Figure 5A visualizes the per-cluster expression of subtype-defining markers in the combined cohort. Subsequent PCA supported the distribution of the identified clusters, with component 1 and component 2 explaining 52.8% and 23.2% of the variance, respectively, accounting for a cumulative 76% of total variance (see Figure 5B). The two POU2F3⁺ patients, i.e., patients 34 and 38 (IMF009 and IMF013 from the IMfirst cohort), are clearly separated from the main cohort distribution. Global subtype marker enrichment across clusters is shown in Figure 5C.

2.4. Survival Metrics and Proportional Hazards

2.4.1. Survival Endpoints, Outcome Events and Time-to-Event Intervals

Survival endpoints, including progression-free survival (PFS), disease-specific survival (DSS), and overall survival (OS), are provided as time-to-event intervals. Corresponding binomial outcome events—including disease progression, disease-specific death, and all-cause death—are included as standalone variables in Datasheet D6 and are also integrated into other patient-level datasheets. Censoring status is also provided for all survival endpoints.

The median follow-up duration was 11.8 months (IQR 7.44–20.7) in the combined cohort, 12.9 months (IQR 7.54–21.9) in the atezolizumab arm, and 9.82 months (IQR 7.01–18.2) in the durvalumab arm. Table 3 summarizes the frequency of outcome events and time-to-event durations for PFS, DSS, and OS.

Given the column-based structure of the dataset, users can readily incorporate survival data into semi-parametric or non-parametric survival analyses, using their statistical software of choice. Figure 6 illustrates time-dependent risk sets, baseline cumulative hazard functions, and survival probability curves stratified by PFS, DSS, and OS for the combined cohort based on non-parametric or Kaplan–Meier survival analysis.

2.4.2. CTA Gene Expression Survival Analysis

Inclusion of scaled CTA gene expression data in a column-based structure, combined with survival time-to-event data, enables survival modeling across the full cohort. Accordingly, adjusted and unadjusted CPH analyses with the Schoenfeld residuals test for the proportional hazards (PH) assumption were performed for all 1722 genes included in the CTA panel and incorporated into the main dataset. Figure 7 visualizes cumulative forest plots of hazard ratios (HRs) derived from unadjusted CPH models using OS as the endpoint for a subset of genes with CI < 0.5. Note that a Schoenfeld p-value > 0.05 indicates no evidence that the effect of gene expression changes over time, i.e., the proportional hazards assumption holds for that covariate.

While corresponding false discovery rates (FDR) exceed 0.8, several genes highlighted in red demonstrate HR estimates with confidence intervals below 1.0 and nominal unadjusted p-values < 0.05.

Results from both unadjusted CPH and covariate-adjusted CPH models involving PFS, DSS and OS across the entire CTA panel are included in Datasheets D7–8. Overall, 35 (2.03%), 48 (2.78%), and 36 (2.09%) genes exhibited non-proportional hazards for confounder-adjusted OS, DSS and PFS, respectively, based on Schoenfeld residuals tests (p-value < 0.05). These genes are indicated in the corresponding datasheets and separately presented in the XLSX File S3.

Adjustment variables include treatment arms (atezolizumab vs. durvalumab) and the presence of bone metastasis as a confounding factor, incorporated to account for clinically relevant baseline heterogeneity (see Table 2). It should be noted that analyses were included to provide survival-ready annotation of the dataset and to demonstrate potential downstream applications. Importantly, the dataset is optimized for analyses conducted across the full combined cohort. Stratified or treatment-specific analyses substantially reduce the number of events available for modeling and may limit statistical precision and stability of estimates.

2.5. Cancer Transcriptome Atlas Biological Pathway Enrichment and Proportional Hazards

2.5.1. CTA Biological Pathway Enrichment at ROI and Patient Level

Currently, CTA covers over 1800 coding RNA probes designed for cancer-specific transcriptional profiling of tumors. These genes are further assigned to over 100 cancer biological pathways (BP) classified into 7 categories, with gene counts ranging from 3 to 261 per pathway [20,21]. The complete CTA biological pathway annotation is provided in Datasheet D9. This gene library is less frequently adopted for pathway enrichment analysis, which is more commonly conducted using the well-established Molecular Signature Database (MSigDB) [22]. Using the scaled gene expression data at the ROI and patient level available in the curated dataset, CTA BP enrichment scores were computed and are available as column-structured matrices in Datasheets D10–11. Quality control (QC) information for CTA BP can be found in Dataset D12. The ROI-level CTA BP enrichment scores were subsequently used for Bayesian posterior estimation of pathway activity at the patient level.

2.5.2. Bayesian CTA Biological Pathway Enrichment Posteriors at Patient Level

Bayesian CTA BP enrichment posteriors with corresponding lower and upper credible interval limits for 106 CTA BPs are provided in Datasheet D13. A patient-labeled heatmap atlas of this Bayesian dataset is presented in Figure 8.

As shown in the heatmap, the combined cohort is stratified by patient clusters and treatment arms, with BPs themselves clustered into 7 BP categories, including adaptive immunity, innate immunity, immune response, cell function, metabolism, signaling pathways, and physiology and disease. This visualization is included to provide users with an overview of the biological pathway composition of the combined cohort.

2.5.3. Bayesian CTA Pathway Enrichment Survival Analysis

While the estimated Bayesian posteriors and credible intervals for CTA BPs provide a probabilistic reference atlas of tumor biology in the combined ES-SCLC cohort, these values were subsequently incorporated into unadjusted univariable and covariate-adjusted multivariable CPH models to provide a clearer picture of potential associations with survival in the combined cohort. The results involving all three survival endpoints (PFS, DSS and OS), including unadjusted p-values and Benjamini–Hochberg (BH) FDR values, are provided in Datasheets D14–15. It should be emphasized that these analyses were not conducted for discovery purposes or to infer novel biological relationships. Figure 9 presents cumulative forest plots of PFS and DSS hazard ratios for a subset of CTA BPs demonstrating relatively narrow confidence intervals. Note that FDR was considered due to multiple hypothesis testing. Based on Schoenfeld residuals tests, all CTA BPs across the three survival endpoints were consistent with PH assumptions (p-value > 0.05) with the exception of “fatty acid synthesis” in confounder-adjusted PFS, which showed evidence of non-proportional hazards (p-value < 0.05).

3. Methods

This section describes the procedures applied to curate and structure the SCAPE-SCLC dataset, including data integration, preprocessing, pathway scoring, and Bayesian modeling workflows. The dataset was derived exclusively from publicly available gene expression and clinical metadata deposited in the GEO. The original clinical investigations were conducted following ethical approval and were registered with the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) under identification numbers 2020-002328-35 “https://www.clinicaltrialsregister.eu/ctr-search/trial/2020-002328-35/results (accessed on 2 February 2026)” and 2019-002784-10 “https://www.clinicaltrialsregister.eu/ctr-search/trial/2019-002784-10/results (accessed on 2 February 2026)”, respectively. All procedures in the original trials were performed by Peressini et al. in accordance with the principles of the Declaration of Helsinki (1975), as revised in 2008 [8]. The SCAPE-SCLC dataset contains only de-identified, publicly available data, and no personally identifiable patient information is included. Accordingly, no additional ethical approval was required for the present data curation and integration procedures.

3.1. Data Structure and Cohort Integration

3.1.1. Source Datasets

This dataset was curated by merging two publicly available spatial transcriptomic datasets deposited in the GEO, including GSE261345 “https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE261345 (accessed on 2 February 2026)” and GSE261348 “https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE261348 (accessed on 2 February 2026)”.

GSE261345 (CANTABRICO cohort) included 121 ROIs corresponding to 26 patients with ES-SCLC treated with etoposide plus platinum chemotherapy combined with durvalumab. GSE261348 (IMfirst cohort) included 175 ROIs corresponding to 32 ES-SCLC patients treated with etoposide plus platinum chemotherapy combined with atezolizumab.

3.1.2. Data Retrieval and File Processing

For each dataset, the following files were retrieved:

Normalized DSP expression matrices (XLSX format)
Series matrix metadata files (TXT format)
CTA probe and pathway annotation files (PKC format)

Expression matrices were imported into RStudio (Version 2026.01.0+392) using the R programming language version 4.5.2. All matrices were transposed to generate column-based data structures, allowing genes and biological variables to be represented as columns and observations as rows. Patient identifiers and ROI identifiers were matched between metadata tables and expression matrices to construct harmonized expression datasets at both the ROI and patient levels.

3.1.3. Assessment of Cohort Compatibility

To assess global transcriptomic compatibility between the two source cohorts, PCA was performed using patient-level log₂-transformed expression values across all genes shared between datasets.

Principal component analysis was conducted using the prcomp() function in R with centering and scaling enabled. Cohort distributions were visualized with ggplot2 using the first two principal components. To explore potential sources of variation, PC1 and PC2 scores were evaluated using multivariable linear regression incorporating treatment arm, age, sex, smoking status, ECOG performance status, platinum agent, and metastatic sites.

3.2. Gene Expression Processing

3.2.1. ROI and Patient-Level Expression Matrices

Primary exhaustive gene expression matrices were generated at both ROI and patient levels using normalized DSP expression values. These matrices included expression data for 1722 genes defined within the CTA panel.

Expression values were log₂-transformed across all gene columns to stabilize variance and improve comparability across samples. Resulting ROI-level and patient-level log₂-transformed matrices were stored as Datasheets D1 and D2, respectively.

Subsequently, log₂-transformed expression values were standardized using z-score normalization across each gene column. Standardized ROI-level and patient-level matrices were stored as Datasheets D3 and D4.

3.2.2. Exploratory SCLC Subtype Clustering

To characterize the molecular composition of the combined ES-SCLC cohort, exploratory unsupervised clustering was performed using the established SCLC subtype markers ASCL1, NEUROD1, POU2F3, and YAP1.

Scaled patient-level expression values were subjected to model-based clustering using Gaussian finite mixture models in R. The optimal number of clusters and covariance structure were determined automatically according to Bayesian Information Criterion (BIC) optimization without pre-specifying the number of clusters. Model fitting was performed iteratively until convergence using default expectation-maximization procedures.

To assess robustness, alternative unsupervised approaches, including hierarchical clustering, nearest-neighbor clustering, density-based clustering, and additional marker combinations incorporating MYC, were explored. Performance metrics, including silhouette score, BIC, and explained variance (R²), were compared across models (Table S3). The final model was selected based on a combination of statistical performance, cluster stability, and biological interpretability.

3.3. Clinical Metadata and Survival Endpoint Construction

3.3.1. Metadata Integration

Patient-level clinical metadata were extracted from row-based series matrix files and integrated into harmonized metadata tables. Available metadata included baseline clinical characteristics, including age, sex, smoking status, ECOG performance status and metastasis, and treatment-related dates.

Matched patient identifiers, both numeric and nominal, were used to ensure consistency between expression matrices and clinical datasets.

3.3.2. Definition of Survival Endpoints

Binary event indicators provided in the source datasets were used to define survival endpoints, including:

Disease progression
Death due to disease
Death from any cause

Censoring status was determined using last follow-up dates when events were absent.

Time-to-event intervals were calculated using treatment initiation (first dose date) as the reference time point:

Time on treatment (ToT): last dose date—first dose date
PFS: progression date—first dose date
DSS: death of disease date—first dose date
OS: death date—first dose date

Time-to-progression (TTP) was also calculated but showed complete equivalence with PFS across all patients and was therefore retained only as a derived reference variable. Patient metadata and survival metrics were stored in Datasheet D5.

3.4. CTA Biological Pathway Enrichment

3.4.1. ROI and Patient-Level Pathway Scoring

CTA pathway definitions were extracted from PKC annotation files associated with the NanoString GeoMx Cancer Transcriptome Atlas. The complete annotation matrix can be found in Datasheet D9.

Using standardized gene expression matrices, pathway enrichment scores were computed as the arithmetic mean of z-score–standardized expression values across all genes assigned to a given CTA biological pathway. Pathway scores were calculated separately for ROI-level and patient-level datasets. To improve score stability and reduce the influence of sparsely represented pathways, enrichment scores were generated only when at least five pathway-associated genes were present in the expression matrix.

ROI-level and patient-level pathway enrichment matrices were generated and stored as Datasheets D10–11. QC statistics are included in Datasheet D12.

3.4.2. Patient-Level Bayesian Estimation of Pathway Enrichment Posteriors

Bayesian hierarchical modeling was selected because pathway activity was measured repeatedly across multiple ROIs per patient, and ROI counts varied between individuals (see Figure 1). Compared with simple arithmetic averaging, hierarchical modeling accounts for within-patient clustering, partially pools information across observations, and generates patient-level posterior estimates accompanied by credible intervals that explicitly quantify uncertainty [23].

For each pathway, ROI-level enrichment scores were modeled using a Gaussian hierarchical model with patient-specific random intercepts (see R codes on the associated GitHub repository v1.1). Let y_ij denote the pathway enrichment score for ROI j from patient i. The model assumed:

y_ij ~ N(α + b_i, σ²)

where α represents the global pathway mean, and b_i denotes the patient-specific random effect. Random effects were assumed to follow:

b_i ~ N(0, τ²)

Models were fitted using four Markov chain Monte Carlo chains, 4000 iterations per chain, 800 warmup iterations, an adaptation delta of 0.95, and a fixed random seed of 123.

Because pathway enrichment scores were standardized as z-scores prior to modelling, weakly informative normal priors centered at zero (mean = 0, SD = 1) were specified for intercept parameters. These priors are consistent with the expected scale of standardized pathway scores while avoiding strong assumptions regarding pathway activity. Exponential priors were specified for random-effect standard deviations to regularize variance estimates while allowing substantial between-patient heterogeneity.

Posterior summaries were extracted from patient-specific random intercept terms. For each patient and pathway, posterior means of b_i together with 95% credible intervals (2.5th and 97.5th posterior quantiles) were retained and stored in Datasheet D13. The increased iteration count was selected to improve effective sample sizes and ensure stable posterior estimation across pathways. Convergence was assessed using the potential scale reduction factor (Rhat) and effective sample sizes. All fitted models achieved Rhat values below 1.01, and no convergence warnings were observed.

3.5. Survival Modeling

Cox Proportional Hazards Modeling

Survival associations were evaluated using CPH models implemented with the survival package. Both univariable and confounder-adjusted multivariable CPH models were applied to:

Patient-level scaled gene expression variables
Patient-level Bayesian pathway enrichment posteriors

Adjusted models included treatment assignment and bone metastasis status as covariates. Hazard ratios, 95% confidence intervals, and model statistics were extracted using the broom package. PH assumptions were evaluated for all univariable and confounder-adjusted CPH models using Schoenfeld residuals tests implemented through the cox.zph() function in R. Both global model-level and variable-specific tests were performed for all survival endpoints (PFS, DSS, and OS). Results were stored across Datasheets D7–8 and D14–15.

3.6. Statistical Analysis

Baseline clinical characteristics, including age, sex, smoking status (current and former smoker), platinum agent, ECOG performance status (0, 1 and 2) and metastasis (CNS, liver and bone) were evaluated to assess compatibility between merged cohorts. Age was compared using Welch’s t-test, while categorical variables were compared using chi-squared or Fisher’s exact tests where appropriate.

Multiple hypothesis testing across gene-level and pathway-level survival analyses was addressed using the Benjamini–Hochberg FDR correction. FDR adjustment was applied across full gene and pathway sets within each survival endpoint.

All statistical analyses were conducted in R using the following primary packages, including dplyr 1.2.1, brms 2.23.0, broom 1.0.12, stringr 1.6.0, survival 3.8-6, tidyr 1.3.2, tidyverse 2.0.0 and data.table 1.18.2.1. Illustrations were created using ggplot2 4.0.2, forestplot 3.2.0 and ComplexHeatmap 2.26.1 packages.

3.7. SCAPeSCLC GitHub Repository

R scripts used to implement both unadjusted and covariate-adjusted CPH modelling across all CTA genes, Bayesian posterior immune profile variables, and survival endpoints are publicly available through the SCAPeSCLC GitHub repository (“https://github.com/Chromocyte/SCAPeSCLC (accessed on 29 April 2026)”). The repository contains the complete set of analysis scripts and processed datasets required to reproduce all modelling procedures described in this work.

To ensure long-term accessibility and version control, an archived release of the repository (v1.1) has been deposited in Zenodo (https://doi.org/10.5281/zenodo.19896168).

4. User Notes

The principal objective of the SCAPeSCLC dataset was to provide an accessible and analysis-ready spatial transcriptomic resource for ES-SCLC. Unlike several common malignancies that benefit from large-scale resources for exploratory gene expression and survival analyses via web-based platforms such as GEPIA [24,25] and KMPlot [26], and publicly available ES-SCLC transcriptomic datasets remain limited and often require substantial preprocessing before integration with clinical outcomes.

Gene expression datasets distributed through the GEO are commonly organized in row-based formats, where genes occupy rows and samples occupy columns [27]. While appropriate for bioinformatics workflows, these structures frequently require transposition and additional preprocessing before integration with clinical covariates or time-to-event analyses [28]. Providing a harmonized column-based dataset may facilitate exploratory, confirmatory, and educational analyses by reducing the amount of data restructuring required prior to statistical modeling. To address this limitation, the SCAPeSCLC dataset was constructed using a column-based architecture in which each gene occupies a dedicated column, while rows correspond to individual ROIs or patient-level records, depending on the analytical context.

Beyond simple harmonization of existing GEO records, SCAPeSCLC incorporates several derived analytical resources generated during dataset curation, including patient-level expression matrices, standardized survival endpoints, pathway enrichment profiles, Bayesian hierarchical pathway activity estimates, and survival modeling outputs. These additions were developed to reduce preprocessing burden and facilitate immediate use of the resource for exploratory and methodological analyses.

Users are encouraged to review the accompanying GitHub repository and archived Zenodo release, which provide example R scripts demonstrating standardized workflows for CPH modeling across gene expression and Bayesian posterior immune profile variables. These scripts may serve as templates for extending analyses to additional genes, clinical covariates, or alternative survival endpoints.

Future expansion of SCAPeSCLC may incorporate additional publicly available spatial transcriptomic datasets as suitable resources emerge. At the time of curation, publicly available SCLC spatial transcriptomic cohorts outside CANTABRICO and IMfirst trials include GSE263196 “https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE263196 (accessed on 3 June 2026)” and GSE318867 “https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE318867 (accessed 3 June 2026)” [29]. However, these datasets were constrained by limited sample size and lacked accompanying survival endpoints at the time of retrieval, restricting their utility for direct validation of the survival-oriented components of the present resource. Nevertheless, future integration of independent spatial datasets may facilitate external evaluation of pathway-level and gene-level observations generated using SCAPeSCLC.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data11070152/s1, File S1: Gene Expression Median Percentiles; File S2: Patient-Level SCLC Subtype Cluster Assignments; File S3: CTA Genes Violating PH Assumptions in Confounder-adjusted CPH; Figure S1: Elbow method plot. The plot shows AIC, BIC and WSS against number of clusters. The lowest BIC value corresponds to four clusters.; Figure S2: Cluster matrix plots. These plots show four clusters identified based on the expression patterns of ASCL1, NEUROD1, POU2F3 and YAP1.; Figure S3: Cluster mean plot. The plot shows mean z-scored expression of ASCL1, NEUROD1, POU2F3 and YAP1 in each identified cluster.; Figure S4: Cluster density plots. The plot shows density of the identified clusters per ASCL1, NEUROD1, POU2F3 and YAP1 based on z-scored expression of these genes.; Table S1: Univariable linear regression analyses evaluating associations between treatment cohort and the first two principal components derived from patient-level log₂-transformed gene expression data; Table S2: Multivariable linear regression analyses evaluating clinical and treatment-related factors associated with the first two principal components derived from patient-level log₂-transformed gene expression data; Table S3: Comparative performance metrics of exploratory unsupervised clustering strategies used for subtype characterization of the combined ES-SCLC cohort; Table S4: Model-based clustering performance metrics.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available in the open access repository Zenodo with the following digital object identifier: https://doi.org/10.5281/zenodo.19897645.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CPH	Cox Proportional Hazards
CTA	Cancer Transcriptome Atlas
DSP	Digital Spatial Profiler
DSS	Disease-Specific Survival
ES-SCLC	Extensive-Stage Small Cell Lung Cancer
GEO	Gene Expression Omnibus
KM	Kaplan–Meier
OS	Overall Survival
PCA	Principal Component Analysis
PFS	Progression-Free Survival
ROI	Region of Interest
SCLC	Small Cell Lung Cancer
ToT	Time on Treatment
TTP	Time to Progression

References

Yan, J.S.; Wang, Y.Z.; Zhao, S.Y.; Na, F.J.; Cui, Z.X.; Cai, L.T.; Li, H.M.; Zhao, M.F. Immunotherapy for Extensive-Stage Small Cell Lung Cancer: Current Status, Challenges and Future Strategies. Cancer Lett. 2026, 645, 218382. [Google Scholar] [CrossRef] [PubMed]
Durer, S.; Fu, P.; Chen, Z.; Dowlati, A. Roles of Restricted Mean Survival Time and Restricted Mean Time Lost in Evaluating Immune Checkpoint Inhibitor Efficacy for Extensive-Stage Small Cell Lung Cancer. Cancer Res. Commun. 2026, 6, 77–84. [Google Scholar] [CrossRef] [PubMed]
Cheema, P.K.; Perdrizet, K.A.; Sangha, R.S.; Breadner, D.; Daaboul, N.; Farley, S.; Jao, K.; Liu, G.; Logan, B.; Melosky, B.; et al. Early Experience with Tarlatamab (T-Cell Engagers) for Extensive-Stage Small Cell Lung Cancer (ES-SCLC) in Canada: Lessons Learned and Implementation Strategies. Curr. Oncol. 2026, 33, 84. [Google Scholar] [CrossRef] [PubMed]
Paz-Ares, L.G.; O’Byrne, K.J.; Johnson, M.L.; Reck, M.; Girard, N.; Hayashi, H.; Zhou, C.; Gharpure, V.S.; Pisupati, R.; Pacius, M.; et al. TIGOS Trial: A Randomized, Double-Blind, Phase III Trial of Atigotatug and Nivolumab Fixed-Dose Combination With Chemotherapy Versus Atezolizumab With Chemotherapy as First-Line Therapy in Patients With Extensive-Stage Small Cell Lung Cancer. Clin. Lung Cancer 2026, 27, 140–145. [Google Scholar] [CrossRef] [PubMed]
Loaiza-Moss, J.; Leitges, M. Pan-Cancer Landscape of Protein Kinase D3: An Integrative TCGA Multi-Omics Analysis of Clinical, Molecular, and Immunological Roles. PLoS ONE 2026, 21, e0346173. [Google Scholar] [CrossRef] [PubMed]
Liu, N.; Bhuva, D.D.; Mohamed, A.; Bokelund, M.; Kulasinghe, A.; Tan, C.W.; Davis, M.J. StandR: Spatial Transcriptomic Analysis for GeoMx DSP Data. Nucleic Acids Res. 2024, 52, e2. [Google Scholar] [CrossRef] [PubMed]
Thomsen, C.; Truumees, B.; Nielsen, S.; Schnack, B.; Sara, N.; Newell, R.; Thorsen, K.; Røge, R. GeoMx and RNAscope: A Comparative Assessment of Their Utility for Spatial MRNA Expression Profiling in Formalin-Fixed Breast Cancer Tissue. Cytom. Part A 2026, 109, 147–154. [Google Scholar] [CrossRef] [PubMed]
Peressini, M.; Garcia-Campelo, R.; Massuti, B.; Martí, C.; Cobo, M.; Gutiérrez, V.; Dómine, M.; Fuentes, J.; Majem, M.; de Castro, J.; et al. Spatially Preserved Multi-Region Transcriptomic Subtyping and Biomarkers of Chemoimmunotherapy Outcome in Extensive-Stage Small Cell Lung Cancer. Clin. Cancer Res. 2024, 30, 3036–3049. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Sun, X.; Liu, Y.; Zhang, Y.; Yang, Z.; Dong, J.; Wang, N.; Ying, J.; Zhou, M.; Yang, L. Spatial Transcriptome-Wide Profiling of Small Cell Lung Cancer Reveals Intra-Tumoral Molecular and Subtype Heterogeneity. Adv. Sci. 2024, 11, 2402716. [Google Scholar] [CrossRef] [PubMed]
del Rey-Vergara, R.; Galindo-Campos, M.A.; Rocha, P.; Carpes, M.; Martínez, C.; Masfarré, L.; Menéndez, S.; Quimis, F.; Rossell, A.; Iñañez, A.; et al. MET Pathway Inhibition Increases Chemo-Immunotherapy Efficacy in Small Cell Lung Cancer. Cell Rep. Med. 2025, 6, 102194. [Google Scholar] [CrossRef] [PubMed]
Donati, B.; Manzotti, G.; Torricelli, F.; Ascione, C.; Valli, R.; Santandrea, G.; Ragazzi, M.; Zanetti, E.; Ciarrocchi, A.; Piana, S. Digital Spatial Profiling for Pathologists. Virchows Arch. 2025, 486, 971–981. [Google Scholar] [CrossRef] [PubMed]
Merritt, C.R.; Ong, G.T.; Church, S.E.; Barker, K.; Danaher, P.; Geiss, G.; Hoang, M.; Jung, J.; Liang, Y.; McKay-Fleisch, J.; et al. Multiplex Digital Spatial Profiling of Proteins and RNA in Fixed Tissue. Nat. Biotechnol. 2020, 38, 586–599. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.; Zou, J.; Xu, F. The Molecular Subtypes of Small Cell Lung Cancer Defined by Key Transcription Factors and Their Clinical Significance. Lung Cancer 2024, 198, 108033. [Google Scholar] [CrossRef] [PubMed]
Baine, M.K.; Hsieh, M.S.; Lai, W.V.; Egger, J.V.; Jungbluth, A.A.; Daneshbod, Y.; Beras, A.; Spencer, R.; Lopardo, J.; Bodd, F.; et al. SCLC Subtypes Defined by ASCL1, NEUROD1, POU2F3, and YAP1: A Comprehensive Immunohistochemical and Histopathologic Characterization. J. Thorac. Oncol. 2020, 15, 1823–1835. [Google Scholar] [CrossRef] [PubMed]
Lo, Y.C.; Rivera-Concepcion, J.; Vasmatzis, G.; Aubry, M.C.; Leventakos, K. Subtype of SCLC Is an Intrinsic and Persistent Feature Through Systemic Treatment. JTO Clin. Res. Rep. 2023, 4, 100561. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Li, S.; Wang, H.; Ren, W.; Chi, K.; Wu, J.; Mao, L.; Huang, X.; Zhuo, M.; Lin, D. Molecular Subtypes, Predictive Markers and Prognosis in Small-Cell Lung Carcinoma. J. Clin. Pathol. 2024, 78, 42–50. [Google Scholar] [CrossRef] [PubMed]
Cheng, D.T.; Mitchell, T.N.; Zehir, A.; Shah, R.H.; Benayed, R.; Syed, A.; Chandramohan, R.; Liu, Z.Y.; Won, H.H.; Scott, S.N.; et al. Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT): A Hybridization Capture-Based next-Generation Sequencing Clinical Assay for Solid Tumor Molecular Oncology. J. Mol. Diagn. 2015, 17, 251–264. [Google Scholar] [CrossRef] [PubMed]
Rudin, C.M.; Brambilla, E.; Faivre-Finn, C.; Sage, J. Small-Cell Lung Cancer. Nat. Rev. Dis. Prim. 2021, 7, 3. [Google Scholar] [CrossRef] [PubMed]
Ireland, A.S.; Micinski, A.M.; Kastner, D.W.; Guo, B.; Wait, S.J.; Spainhower, K.B.; Conley, C.C.; Chen, O.S.; Guthrie, M.R.; Soltero, D.; et al. MYC Drives Temporal Evolution of Small Cell Lung Cancer Subtypes by Reprogramming Neuroendocrine Fate. Cancer Cell 2020, 38, 60–78.e12. [Google Scholar] [CrossRef] [PubMed]
Cho, C.; Haddadi, N.S.; Kidacki, M.; Woodard, G.A.; Shakiba, S.; Yıldız-Altay, Ü.; Richmond, J.M.; Vesely, M.D. Spatial Transcriptomics in Inflammatory Skin Diseases Using GeoMx Digital Spatial Profiling: A Practical Guide for Applications in Dermatology. JID Innov. 2025, 5, 100317. [Google Scholar] [CrossRef] [PubMed]
Hernandez, S.; Lazcano, R.; Serrano, A.; Powell, S.; Kostousov, L.; Mehta, J.; Khan, K.; Lu, W.; Solis, L.M. Challenges and Opportunities for Immunoprofiling Using a Spatial High-Plex Technology: The NanoString GeoMx^® Digital Spatial Profiler. Front. Oncol. 2022, 12, 890410. [Google Scholar] [CrossRef] [PubMed]
Liberzon, A.; Birger, C.; Thorvaldsdóttir, H.; Ghandi, M.; Mesirov, J.P.; Tamayo, P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015, 1, 417–425. [Google Scholar] [CrossRef] [PubMed]
Hukku, A.; Quick, C.; Luca, F.; Pique-Regi, R.; Wen, X. BAGSE: A Bayesian Hierarchical Model Approach for Gene Set Enrichment Analysis. Bioinformatics 2020, 36, 1689–1695. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Kang, B.; Li, C.; Chen, T.; Zhang, Z. GEPIA2: An Enhanced Web Server for Large-Scale Expression Profiling and Interactive Analysis. Nucleic Acids Res. 2019, 47, W556–W560. [Google Scholar] [CrossRef] [PubMed]
Kang, Y.J.; Pan, L.; Liu, Y.; Rong, Z.; Liu, J.; Liu, F. GEPIA3: Enhanced Drug Sensitivity and Interaction Network Analysis for Cancer Research. Nucleic Acids Res. 2025, 53, W283–W290. [Google Scholar] [CrossRef] [PubMed]
Lánczky, A.; Győrffy, B. Web-Based Survival Analysis Tool Tailored for Medical Research (KMplot): Development and Implementation. J. Med. Internet Res. 2021, 23, e27633. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Lachmann, A.; Ma’ayan, A. Mining Data and Metadata from the Gene Expression Omnibus. Biophys. Rev. 2019, 11, 103–110. [Google Scholar] [CrossRef] [PubMed]
Jäger, N. Bioinformatics Workflows for Clinical Applications in Precision Oncology. Semin. Cancer Biol. 2022, 84, 103–112. [Google Scholar] [CrossRef] [PubMed]
Campisi, M.; Osaki, T.; Dryg, I.; Stornante, C.; Wolff, J.; Weirather, J.; Weaver, N.; Tarannum, M.; Gillanders, I.; Bers, A.; et al. Vascular STING Activation Facilitates NK Cell Anti-Tumor Immunity in Small Cell Lung Cancer. Cancer Cell 2026, 44, 858–878.e16. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Principal component analysis (PCA) of patient-level log₂-transformed gene expression data across the combined ES-SCLC cohort. Samples are colored according to treatment cohort (A), sex (B), and bone metastasis status (C). Ellipses represent 95% confidence regions for each group.

Figure 2. Density histogram of global gene expression across the combined cohort (A), density histogram of gene expression distributions per patient (B), density histogram of gene expression distributions by treatment arm (C), and allocation of ROIs per patient (D). All gene expression values are presented as log₂-transformed values. Colors are used to distinguish individual patients and cohorts, and do not represent quantitative values.

Figure 3. Density distributions of ASCL1 (red), NEUROD1 (blue), POU2F3 (green), and YAP1 (purple) across the combined cohort (A), and box plots showing expression levels and median percentile ranks of ASCL1, NEUROD1, POU2F3, and YAP1 (B). All gene expression values are presented as log₂-transformed values.

Figure 4. Raincloud plots illustrating the expression of 14 SCLC-related commonly altered genes in the combined cohort of ES-SCLC patients. Gene expression values are presented as log₂-transformed values.

Figure 5. Mean expression bar plots of ASCL1 (red), NEUROD1 (blue), POU2F3 (green), and YAP1 (purple) with 95% confidence interval bars across the identified clusters (A), PCA biplot illustrating the distribution of clusters based on subtype marker expression (B), and patient number-annotated heatmap of ASCL1, NEUROD1, POU2F3, and YAP1 expression stratified by patient cluster assignment (C).

Figure 6. Kaplan–Meier–based visualization of time-to-event outcomes in the combined cohort, showing percentage risk over time for PFS, DSS, and OS (A), cumulative hazard functions for PFS, DSS, and OS (B), and survival probability curves for PFS, DSS, and OS with corresponding numbers at risk and cumulative event counts displayed below the time axis (C).

Figure 7. Forest plots of unadjusted HRs for OS derived from CPH models for a subset of CTA genes with CI width < 0.5. Individual p-values, Benjamini–Hochberg-adjusted false discovery rates and p-values associated with the Schoenfeld residuals test for PH assumption are shown on the right. Dots colored in red indicate p-values < 0.05.

Figure 8. Heatmap visualization of CTA biological pathway enrichment z-scores stratified by patient clusters and treatment arms, with BPs grouped into 7 primary categories and annotated with patient identifiers along the right axis. Positive and negative values indicate higher and lower-than-average pathway activity per patient, respectively. These posterior estimates provide pathway-level patient-to-patient variations that should be interpreted in parallel to patient-level mean pathway activity available in Datasheet D12 (P & D: physiology and disease).

Figure 9. Forest plot visualization of hazard ratios for selected CTA biological pathways using PFS (A) and DSS (B) endpoints, limited to pathways demonstrating relatively narrow confidence interval widths (<1.0). Individual p-values, BH-adjusted FDRs and p-values associated with the Schoenfeld residuals test for PH assumption are shown on the right. Dots colored in red indicate p-values < 0.05.

Table 1. Summary of source data series used for dataset curation.

Data Series	Platform	Sample (N)	Patient (N)	Cohort	Treatment	Ref.
GSE261348	GeoMx DSP	175	32	IMfirst	Atezolizumab	[8]
GSE261345	GeoMx DSP	121	26	CANTABRICO	Durvalumab	[8,10]

Table 2. Demographic information and clinical characteristics of patients in the combined cohort and treatment arms.

Variable			Combined Cohort (N = 58)	Cohort
Variable			Combined Cohort (N = 58)	Atezolizumab (N = 32)	Durvalumab (N = 26)	p-Value
Age (yr)			64.5 ± 8.02	62.9 ± 8.57	66.3 ± 7.00	0.101 *
Sex		Female	19 (32.8%)	9 (28.1%)	10 (38.5%)	0.404 ^†
Sex		Male	39 (67.2%)	23 (71.9%)	16 (61.5%)	0.404 ^†
Smoking Status		Current	28 (48.3%)	13 (40.6%)	15 (57.7%)	0.196 ^†
Smoking Status		Former	30 (51.7%)	19 (59.4%)	11 (42.3%)	0.196 ^†
Platinum Agent		Carboplatin	50 (86.2%)	27 (84.4%)	23 (88.5%)	0.720 ^‡
Platinum Agent		Cisplatin	8 (13.8%)	5 (15.6%)	3 (11.5%)	0.720 ^‡
ECOG Performance		0	16 (27.6%)	10 (31.25%)	6 (23.1%)	0.833 ^‡
		1	38 (65.5%)	20 (62.50%)	18 (69.2%)
		2	4 (6.9%)	2 (6.25%)	2 (7.7%)
Metastasis	CNS	Yes	8 (13.8%)	6 (18.75%)	2 (7.7%)	0.278 ^‡
	CNS	No	50 (86.2%)	26 (81.25%)	24 (92.3%)	0.278 ^‡
	Liver	Yes	20 (34.5%)	11 (34.4%)	9 (34.6%)	0.985 ^†
	Liver	No	38 (65.5%)	21 (65.6%)	17 (65.4%)	0.985 ^†
	Bone	Yes	16 (27.6%)	4 (12.5%)	12 (46.15%)	0.007 ^‡
	Bone	No	42 (72.4%)	28 (87.5%)	14 (53.85%)	0.007 ^‡
	Overall	Yes	36 (62.1%)	17 (53.1%)	19 (73.1%)	0.119 ^†
	Overall	No	22 (37.9%)	15 (46.9%)	7 (26.9%)	0.119 ^†

ECOG: Eastern Cooperative Oncology Group; CNS: central nervous system. * Calculated using the Welch’s test. ^† Calculated using the Chi-squared test of association. ^‡ Calculated using Fisher’s exact test of association.

Table 3. Summary of outcome events and time-to-event characteristics for survival endpoints in the combined cohort and treatment arms.

Cohort	Statistics	Endpoint
Cohort	Statistics	PFS	DSS	OS
Atezolizumab (N = 32)	Events (N)	25	21	23
	Censored (N)	7	11	9
	Median (months)	6.90	13.0	13.0
	95% CI (months)	6.95–13.2	11.8–18.3	11.8–18.3
Durvalumab (N = 26)	Events (N)	23	19	21
	Censored (N)	3	7	5
	Median (months)	6.29	10.1	10.1
	95% CI (months)	5.79–11.9	9.22–16.4	11.8–18.3
Combined (N = 58)	Events (N)	48	40	44
	Censored (N)	10	18	14
	Median (months)	6.75	11.8	11.8
	95% CI (months)	7.38–11.7	11.7–16.4	11.7–16.4

DSS: disease-specific survival; OS: overall survival; PFS: progression-free survival.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shirvaliloo, M. SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer. Data 2026, 11, 152. https://doi.org/10.3390/data11070152

AMA Style

Shirvaliloo M. SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer. Data. 2026; 11(7):152. https://doi.org/10.3390/data11070152

Chicago/Turabian Style

Shirvaliloo, Milad. 2026. "SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer" Data 11, no. 7: 152. https://doi.org/10.3390/data11070152

APA Style

Shirvaliloo, M. (2026). SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer. Data, 11(7), 152. https://doi.org/10.3390/data11070152

Article Menu

SCAPeSCLC: An Integrated Spatial Transcriptomic and Bayesian Pathway Enrichment Dataset for Survival Modeling in Extensive-Stage Small Cell Lung Cancer

Abstract

1. Summary

2. Data Description

2.1. Dataset Components and Characteristics

2.2. Patient Demographics and Baseline Clinical Characteristics

Principal Component Analysis of Cohort Structure

2.3. Expression of Cancer Transcriptome Atlas Genes

2.3.1. ROI Allocation and Global Gene Expression Distributions

2.3.2. SCLC Subtype Markers and Key Cancer-Related Genes

2.3.3. Patient Clustering Based on Subtype Marker Expression

2.4. Survival Metrics and Proportional Hazards

2.4.1. Survival Endpoints, Outcome Events and Time-to-Event Intervals

2.4.2. CTA Gene Expression Survival Analysis

2.5. Cancer Transcriptome Atlas Biological Pathway Enrichment and Proportional Hazards

2.5.1. CTA Biological Pathway Enrichment at ROI and Patient Level

2.5.2. Bayesian CTA Biological Pathway Enrichment Posteriors at Patient Level

2.5.3. Bayesian CTA Pathway Enrichment Survival Analysis

3. Methods

3.1. Data Structure and Cohort Integration

3.1.1. Source Datasets

3.1.2. Data Retrieval and File Processing

3.1.3. Assessment of Cohort Compatibility

3.2. Gene Expression Processing

3.2.1. ROI and Patient-Level Expression Matrices

3.2.2. Exploratory SCLC Subtype Clustering

3.3. Clinical Metadata and Survival Endpoint Construction

3.3.1. Metadata Integration

3.3.2. Definition of Survival Endpoints

3.4. CTA Biological Pathway Enrichment

3.4.1. ROI and Patient-Level Pathway Scoring

3.4.2. Patient-Level Bayesian Estimation of Pathway Enrichment Posteriors

3.5. Survival Modeling

Cox Proportional Hazards Modeling

3.6. Statistical Analysis

3.7. SCAPeSCLC GitHub Repository

4. User Notes

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI