1. Introduction
Gastric cancer (GC) continues to pose a major challenge to global health, ranking as the fifth most frequently diagnosed malignancy and the fourth leading cause of cancer-related death worldwide [
1]. Prognosis remains strongly dependent on the stage at diagnosis, with 5-year survival rates exceeding 70% for localized disease but dropping below 10% for metastatic GC [
2].
Against this background, artificial intelligence (AI) is being explored as a tool to support screening and diagnosis across a broad spectrum of surgical diseases, including benign and malignant gastrointestinal disorders. AI-based algorithms are being incorporated into screening pathways and perioperative decision-making in surgical oncology to improve detection efficiency and diagnostic accuracy [
3]. These developments illustrate the potential of AI-driven predictive models to facilitate earlier detection of GC and to optimize the use of resource-intensive investigations such as upper endoscopy in high-risk populations.
Harnessing routinely collected electronic health record (EHR) data for earlier GC identification could improve outcomes by enabling timely endoscopy and treatment. Structured EHR signals—such as demographics, diagnoses, and laboratory data—represent an attractive source for data-driven cancer screening [
4,
5]. However, EHR-based model development can face persistent challenges, including incomplete or missing records, irregular sampling intervals, and the overall complexity of clinical documentation. Moreover, ethical and regulatory requirements—such as institutional approvals or patient consent—can further limit the volume of analyzable data, leading to relatively small, heterogeneous datasets for model training.
For gastric cancer prediction, machine learning models such as logistic regression and XGBoost, using routine structured variables (e.g., demographics, diagnoses, and common laboratory tests), have been investigated [
5,
6,
7]. Despite these advances, most prior studies have focused on within-cancer prediction and have not examined whether cross-cancer regularities in structured EHR signals (e.g., anemia, inflammation, and metabolic axes) can be distilled and transferred to improve GC modeling under data scarcity.
Across gastrointestinal (GI) and hepatopancreatobiliary (HPB) malignancies, several clinical and laboratory signatures that are visible in structured EHRs recur across tumor types and are directly relevant for GC. Luminal GI tumors, particularly colorectal and gastric cancers, often present with chronic occult bleeding and iron-deficiency anemia (IDA); population-based studies using primary care or administrative data show that anemia and low hemoglobin are strong early signals for colorectal cancer, and clinical practice updates emphasize that otherwise unexplained IDA in adults should prompt evaluation of the upper and lower GI tract for malignancy [
8,
9,
10,
11]. In chronic liver disease and primary liver cancer, IDA frequently co-occurs with systemic inflammation and malnutrition, and biomarkers such as albumin, C-reactive protein, neutrophil–lymphocyte ratio, and platelet–lymphocyte ratio are prognostic in both liver cancer and GC [
12,
13,
14,
15]. Metabolic syndrome and dysglycemia form another shared axis: large cohort and meta-analytic studies link diabetes and metabolic syndrome not only to colorectal and pancreatic cancer risk but also to increased GC incidence [
16,
17,
18,
19,
20]. Taken together, these data suggest that routinely collected features such as complete blood count indices, basic chemistries, and International Classification of Diseases (ICD) codes for IDA, liver disease, diabetes, and other metabolic comorbidities encode biologically meaningful cross-cancer information. A multi-cancer pretraining stage can therefore exploit this shared structure to learn representations that are transferable to GC.
Deep learning (DL) provides a principled way to learn non-linear interactions from tabular EHR variables and to distill noisy laboratory measurements into task-relevant representations. When labels for GC are limited, transfer learning and multi-task learning can mitigate data scarcity by first inducing shared, cancer-agnostic representations on related cancers and then adapting them to GC [
21,
22,
23]. In this paradigm, DL primarily functions as a representation learner that captures cross-cancer laboratory signatures expected to recur in GC, thereby improving sample efficiency and stability during adaptation.
Recent work has increasingly explored self-supervised pretraining and EHR foundation models, which learn general-purpose patient representations using contrastive or masked-feature objectives before task-specific fine-tuning [
24,
25,
26]. Our work is complementary to these efforts: instead of learning a general-purpose foundation model from very large unlabeled corpora, we study a supervised transfer setting in which labeled non-GC cancer cohorts are used to pre-train a neural encoder and then adapt it to GC. This design examines how effectively structured variables and cross-cancer label supervision can be leveraged for GC representation learning, while remaining compatible with future integration of self-supervised objectives.
We focus our initial modeling on routinely collected structured EHR variables (demographics, ICD-derived comorbidities, and routine laboratory tests) since these variables are ubiquitously available across care settings, standardized, and low-burden to obtain, which makes models straightforward to deploy in real-world workflows [
4,
5]. We therefore treat structured EHRs as a deployment-friendly starting point; integrating imaging and notes is an important direction for future work once the structured-only baseline is firmly established.
Accordingly, we aimed to investigate a cross-cancer transfer-learning paradigm with deep learning based on EHR data: we pre-train a shared multilayer-perceptron (MLP) backbone on non-gastric GI/HPB cancers using routinely collected EHR variables (demographics, diagnoses/ICD codes, and laboratory values) and then adapt it to GC via fine-tuning. We evaluate this hypothesis retrospectively in the MIMIC-IV v3.1 EHR database [
27], restricting inputs to structured EHR data available without using diagnostic endoscopy or pathology.
From a clinical perspective, the proposed framework is especially relevant for health systems that see relatively few GC cases but routinely manage other gastrointestinal and hepatopancreatobiliary malignancies. In such settings, our study shows how existing structured EHR data and non-GC cancer cohorts can be reused to build a risk model for GC under label scarcity without requiring new data collection. Beyond GC, the same cross-cancer transfer framework can be directly adapted to other underrepresented cancers or rare tumor subtypes where labeled data are limited but related malignancies are more prevalent.
From a methodological standpoint, our framework is deliberately aligned with established multi-task and transfer-learning formulations [
21,
22,
23]. Classical multi-task learning typically trains a single model jointly on multiple tasks and reports performance averaged across those tasks. In contrast, we treat colorectal, esophageal, liver, and pancreatic cancers as auxiliary source tasks that are used only during pretraining, and we evaluate performance exclusively on a single target task (gastric cancer) under varying degrees of label scarcity. Our primary goal is to ask an applied question: to what extent can cross-cancer supervision on routine laboratory and comorbidity profiles improve GC risk modeling under label scarcity when using only structured EHR variables? Accordingly, the contribution of this work is mainly empirical and translational: we provide the assessment of cross-cancer transfer on structured EHRs for GC, including (i) label-fraction sweeps that emulate low-resource GC settings, (ii) ablations on pretraining source-cancer composition, and (iii) comparisons between freezing versus full fine-tuning of the shared encoder.
4. Discussion
In this retrospective study using de-identified structured EHR data from MIMIC-IV v3.1 [
27], a transfer learning strategy that pre-trains a multilayer perceptron on non-gastric cancers (CC/EC/LC/PC) and then adapts to GC achieved the best overall performance among strong non-transfer baselines. On the full GC label regime (
), Transfer attained the highest AUROC and F1 (
Table 2); AP was essentially tied with the scratch MLP (Transfer
vs. MLP
). When we progressively reduced GC training labels (
), Transfer preserved a consistent AUROC advantage and typically higher AP—with the largest margins under the scarcest labels (
Figure 2; e.g., at
, Transfer vs. scratch MLP: AUROC
vs.
, AP
vs.
, F1
vs.
). We also observed that full fine-tuning generally outperformed frozen-backbone adaptation once modest GC supervision was available, while freezing remained competitive at the most extreme data scarcity (
Figure 3). Finally, varying the composition of pretraining sources showed that combining four sources (CC + EC + LC + PC) produced the best overall performance—raising AUROC and F1 over the scratch MLP and yielding an AP essentially on par with it (
Table 3).
Our findings align with prior observations that risk signals visible in structured EHR data recur across GI/HPB malignancies and therefore can act as transferable supervision for GC [
4,
9,
10,
12,
13,
14,
18,
19,
20,
35]. While earlier GC prediction efforts have largely optimized within-cancer models using routine variables [
5,
6,
7], our results suggest that cross-cancer pretraining helps distill cancer-agnostic representations from noisy labs and comorbidities, which then transfer to GC with improved sample efficiency. This is consistent with established benefits of multi-task and transfer learning on related tasks [
21,
22,
23].
Our findings are consistent with prior reports that transfer learning improves oncologic prediction under label scarcity across various data modalities. In medical imaging, models pre-trained on large source datasets and then fine-tuned on smaller tumor-specific cohorts routinely yield higher discrimination and better generalization [
36]. In omics, cross-cancer transfer learning improves survival or progression prediction [
37,
38]. Our cross-cancer transfer on structured EHR variables parallels these patterns and suggests that cancer-agnostic signals learned from routine labs and comorbidity profiles can be reused to stabilize GC modeling when labels are limited.
We note that the absolute improvement in AUROC when comparing Transfer with the scratch MLP is modest (
Table 2; AUROC
vs.
). As detailed in the Results section, these gains are accompanied by improvements in F1 (0.636 vs. 0.609) and specificity (0.786 vs. 0.739), while sensitivity remains broadly similar (0.768 vs. 0.782). Despite the moderate size of the held-out GC test set, the observed differences are consistent across repeated train–test splits and bootstrap-based uncertainty estimates. Thus, we do not claim a dramatic effect size at the population level, but rather a small and reproducible shift in the operating characteristics of the model. The main value of the proposed framework lies in its stability and behavior under label scarcity. As GC training labels are reduced, the relative advantage of Transfer becomes more pronounced in AUROC, AP, and F1 (
Figure 2), suggesting that pretraining on related cancers yields a more stable operating profile when data are limited. In line with this, DeLong’s test showed statistically significant AUROC improvements of the Transfer model over LR and XGB, while the comparison versus the scratch MLP did not reach conventional significance (
), which is compatible with a small effect size and limited statistical power given the moderate GC test cohort size.
From a clinical standpoint, even such modest but consistent gains in discrimination and precision may still be relevant when predictions are used to prioritize patients for resource-intensive investigations such as upper endoscopy. The Transfer model achieves similar sensitivity but higher specificity and F1 than the scratch MLP, implying that, for a fixed endoscopy capacity, a slightly larger fraction of procedures would be directed toward patients who truly have GC and fewer toward those at very low risk. In high-volume settings where only a minority of inpatients can be referred for endoscopy, small improvements in the concentration of true GC cases within the high-risk stratum could translate into earlier diagnosis or fewer unnecessary procedures.
The operating points reported in this study (e.g., the thresholds used to summarize F1, sensitivity, and specificity) were selected on the validation set to provide a single, reproducible basis for internal model comparison. They should therefore be interpreted as illustrative examples of model behavior under specific thresholds, rather than as recommended clinical cutoffs. In practice, any deployment would require choosing a risk threshold
in close collaboration with clinicians and service planners, taking into account local GC prevalence, endoscopy capacity, and the relative harms of false negatives and false positives. A pragmatic approach would be to first specify an acceptable lower bound on sensitivity (for example, requiring that at least 90% of GC cases over the prediction horizon are flagged as “high risk”). For each candidate threshold that meets this condition, one can then calculate simple quantities such as the expected number of endoscopies required per additional cancer detected [
39]. These summaries can be compared across thresholds, taking into account local resources and the expected case mix, to select a clinically acceptable operating point. In centers with very limited endoscopy availability, a higher threshold
might be chosen to prioritize specificity and limit the number of referrals, whereas screening programs or high-risk clinics may deliberately adopt a lower threshold to minimize missed cancers even at the cost of more false-positive procedures.
We did not perform a formal decision-curve or clinical-impact analysis in this work, and our conclusions are therefore limited to conventional discrimination and classification metrics. Future studies should quantify the net clinical benefit of using the proposed transfer-learning model to guide endoscopy referrals, for example by comparing the number of additional GC cases detected and the number of endoscopies avoided per 1000 patients across a range of clinically plausible risk thresholds.
From a calibration standpoint, our reliability and ECE analyses indicate that all four models achieve broadly acceptable calibration for GC risk estimation, with the Transfer model showing slightly more over-confident predictions than the logistic regression baseline. Such over-confidence could in principle be addressed by post hoc recalibration on an external or prospective local cohort while holding the original model parameters fixed. Concretely, one could fit a calibration model that maps the Transfer model’s risk score or its logit to a recalibrated probability, for example by updating only the intercept to match the overall GC incidence (recalibration-in-the-large) or by fitting a monotone non-parametric model such as isotonic regression [
40,
41]. In this setup, the Transfer model continues to provide the ranking of patients from lower to higher GC risk, and the separate calibration model aligns the predicted probabilities with the observed GC incidence in the target population.
We hypothesize three complementary mechanisms behind the observed gains: (i)
representation sharing—pretraining encourages the backbone to encode recurring EHR patterns that GC also exhibits; (ii)
regularization under scarcity—pretraining reduces the effective hypothesis space, aiding generalization when GC labels are limited; and (iii)
optimization stability—a pretrained initialization improves convergence to better optima compared with training from scratch. The superiority of full fine-tuning over freezing at moderate data availability suggests that some GC-specific re-alignment of shared features is beneficial, whereas freezing can mitigate variance when labels are extremely scarce, producing similar discrimination with fewer trainable degrees of freedom. These observations match classic transfer learning trade-offs between bias and variance [
22,
23]. Consistent with this interpretation, the repetition-level confidence intervals in
Figure 2 remain relatively tight and nearly invariant across GC sampling rates for the Transfer model, whereas most baselines exhibit visibly larger dispersion at lower rates. This pattern supports the view that pretraining acts as an implicit regularizer that dampens sensitivity to sampling noise in small GC cohorts.
Compared to single-source pretraining choices (CC, EC, LC, PC), combining all four pretraining choices delivered the strongest overall balance—improving AUROC and F1 versus scratch and yielding an AP essentially on par with it (
Table 3). These results imply that (i) each source cancer contributes overlapping but not identical EHR-visible signals; and (ii) diversity across sources can add incremental value. We do not claim that our specific mix is universally optimal; rather, our results motivate future work on principled source selection and weighting schemes that account for relatedness, cohort size, and label quality.
To further contextualize the model behavior, we inspected global feature importance for the main Transfer model using mean absolute SHAP values (DeepLIFT-based) on the held-out GC test set (
Figure 4). Rather than being driven by a single variable, the model primarily relies on several clinically meaningful axes: (i) anemia and nutritional status, reflected by hemoglobin, red cell distribution width, and protein–calorie malnutrition; (ii) systemic inflammation and immune activation, captured by white blood cell count and monocyte percentage; (iii) metabolic and acid–base derangements, represented by anion gap and bicarbonate; (iv) cardiometabolic and oncologic comorbidities and exposures, including hypertension, smoking, family history of cancer, coronary atherosclerosis, gastroesophageal reflux disease, and adverse effects of antineoplastic agents; and (v) baseline demographic risk (age and sex). These groupings are consistent with established clinical links between iron-deficiency anemia, malnutrition, chronic inflammation, cardiometabolic burden, familial predisposition, and gastrointestinal malignancy, supporting the view that the Transfer model is relying on clinically plausible EHR signals rather than spurious artifacts.
Taken together, these observations indicate that our study should be viewed primarily as an applied instantiation of multi-task and transfer-learning principles in a gastric cancer setting, rather than as a proposal of a fundamentally new algorithm. The backbone–head architecture, class-weighted loss, and fine-tuning strategy are intentionally simple and follow standard practice in deep learning for tabular EHR data [
21,
22,
23]. The added value lies in the systematic evaluation of cross-cancer pretraining, label scarcity regimes, frozen versus full fine-tuning variants, and source-composition effects using only deployment-ready structured EHR variables.
From a clinical standpoint, the proposed framework is not meant to replace diagnostic endoscopy, but to support triage and prioritization in routine workflows. Because it relies only on structured EHR variables that are already collected in most hospitals, it could be implemented as a background risk score to flag patients who may benefit from earlier endoscopic evaluation or closer follow-up, particularly in systems where endoscopy capacity or specialist access is limited. Earlier identification of higher-risk patients in routine care could shorten time-to-endoscopy and facilitate detection at more treatable stages, where survival is markedly better [
2]. Methodologically, the cross-cancer transfer design illustrates how related malignancies can serve as source tasks to bootstrap prediction models for cancers with few labels, a scenario that arises frequently in oncology (e.g., rare tumors, early-stage subgroups, or low-incidence settings). The transfer framework may therefore be appealing for institutions with few confirmed GC cases but access to other GI/HPB cancer cohorts; in principle, pretraining on related cancers could borrow strength and stabilize GC performance, although this remains to be confirmed in external settings.
From a deployment perspective, the proposed framework is relatively lightweight. All predictors are derived from routinely collected structured EHR data (demographics, comorbid diagnoses, medications, and basic laboratory tests) that are already available in most hospital information systems. Feature construction relies on simple summary statistics over a pre-specified look-back window and does not require free-text processing or medical imaging pipelines. The final GC risk score is produced by a shallow feed-forward neural network trained on tabular inputs, which can be evaluated efficiently on standard CPU hardware and is amenable to batch scoring of large inpatient cohorts in routine EHR jobs. Although we did not perform a formal benchmarking study of computational cost, the small model size and reliance on common laboratory panels suggest that runtime is unlikely to be the main barrier to real-world use. Instead, the more critical challenges for deployment are data governance, integration into local EHR workflows, prospective calibration monitoring, and external validation in diverse care settings. In addition, consistent with recent guidance on clinical risk prediction tools, safe deployment will require careful assessment of model calibration and subgroup fairness (e.g., across age, sex, and race/ethnicity), beyond the aggregate results reported here.
This study has several limitations.
First, all experiments were conducted using MIMIC-IV, a de-identified, single-center EHR database derived largely from inpatient encounters at a U.S. tertiary academic medical center. As a result, our models were trained and evaluated in a relatively homogeneous institutional context, with shared clinical workflows, laboratory assays, coding practices, and endoscopy referral thresholds. Patient demographics, case mix, and access to care may differ substantially in outpatient settings, community hospitals, or non-U.S. health systems, leading to distributional shifts in both predictors and outcome prevalence. Therefore, the absolute risk estimates and performance metrics reported here should not be assumed to generalize to other institutions or outpatient populations without further evaluation. Real-world deployment will require rigorous external validation and, if necessary, recalibration on independent multi-center and outpatient EHR cohorts to assess robustness under domain shift and to quantify clinical impact. In a future multicenter or international study, such validation would allow us to directly quantify how well the shared encoder and transfer-learning framework transport to hospitals with different patient populations, coding systems, and laboratory workflows, including higher-incidence regions such as East Asia. If substantial performance or calibration drift is observed, the model could be adapted in a stepwise fashion, for example by recalibrating the output layer with a modest number of local cases or by fine-tuning the encoder on pooled data from participating sites. These findings would in turn guide practical integration into hospital EHR systems by clarifying whether a single global model with site-specific decision thresholds is sufficient, or whether locally adapted models are needed to support safe deployment.
Second, cases were indexed by the first GC diagnosis code, and for each CBC/CMP laboratory we summarized the oldest available measurement within the 730-day pre-index window for cases (and across the full available record for controls). This policy was intended to reduce peri-diagnostic bias, whereby laboratory values obtained during diagnostic work-up or acute decompensation near the time of diagnosis might dominate the signal, and to emulate a more screening-like setting in which risk scores are computed at routine encounters earlier in the disease course. However, by collapsing irregular longitudinal trajectories into single-time-point snapshots, this representation cannot exploit trends, volatility, or recovery patterns that often carry additional predictive information and may attenuate the discriminative capacity of the model compared with fully longitudinal approaches. Moreover, because at least one measurement of each laboratory is required, this strategy likely enriches the cohort for patients with more complete laboratory histories and higher health-care utilization, which may bias performance estimates and limit generalizability to populations with sparser testing.
Third, we restricted all analyses to a “lab-complete” cohort, requiring non-missing values for all predefined laboratory predictors within the look-back window. This design choice simplifies model development and avoids introducing additional uncertainty from ad hoc imputation; however, it also induces selection bias. In routine care, comprehensive laboratory panels are more likely to be ordered for patients with greater healthcare utilization and higher comorbidity burden, whereas individuals with fewer encounters or milder symptoms are less likely to have all tests measured. As a result, the case mix, event rate, and distribution of predictors in our training data may differ from those in a broader outpatient or low-utilization population. In our setting, this selection mechanism is likely to introduce an optimistic bias: the lab-complete cohort is enriched for patients with denser testing and more informative predictors, so the discrimination metrics (e.g., AUROC) and positive predictive value reported here may overestimate the performance that would be observed in real-world populations where laboratory testing is sparser and missingness is more prevalent. In such settings, both AUROC and, in particular, precision/recall-based measures may be lower, and model calibration may deteriorate if the model is applied without recalibration. Future work should relax the lab-complete requirement by incorporating principled strategies for handling missing data (e.g., multiple imputation or model-based imputation), so that patients with partial laboratory profiles can also contribute to model development.
In addition, because the held-out GC test set is of moderate size with a limited number of positive cases, the bootstrap confidence intervals are relatively wide and partially overlapping across models, which reflects finite-sample uncertainty. Larger external cohorts will be required to narrow these intervals and to more precisely quantify the performance differences between models.
Finally, we did not evaluate subgroup fairness (e.g., performance stratified by age, sex, or race/ethnicity), and our calibration assessment was restricted to aggregate reliability diagrams and ECE on the GC test set. For a clinical risk prediction tool, however, well-calibrated risk estimates and equitable performance across key patient subgroups are essential for safe deployment; accordingly, detailed subgroup- and setting-specific calibration and fairness analyses, ideally using external multi-center cohorts, will be required before any clinical use.
Several extensions appear promising. (i) Move beyond static features to longitudinal encoders that incorporate trends, rates of change, and testing frequency. (ii) Combine supervised multi-task pretraining with self-supervised objectives on large unlabeled EHR (e.g., masked feature modeling) to further improve sample efficiency. (iii) Explore domain adaptation and source reweighting to mitigate negative transfer when source–target mismatch is large. (iv) Expand external validation across health systems.