Analyzing the Impact of Oncological Data at Different Time Points and Tumor Biomarkers on Artificial Intelligence Predictions for Five-Year Survival in Esophageal Cancer

: AIM: In this study, we use Artificial Intelligence (AI), including Machine (ML) and Deep Learning (DL), to predict the long-term survival of resectable esophageal cancer (EC) patients in a high-volume surgical center. Our objective is to evaluate the predictive efficacy of AI methods for survival prognosis across different time points of oncological treatment. This involves comparing models trained with clinical data, integrating either Tumor, Node, Metastasis (TNM) classification or tumor biomarker analysis, for long-term survival predictions. METHODS: In this retrospective study, 1002 patients diagnosed with EC between 1996 and 2021 were analyzed. The original dataset comprised 55 pre-and postoperative patient characteristics and 55 immunohistochemically evaluated biomarkers following surgical intervention. To predict the five-year survival status, four AI meth-ods (Random Forest RF, XG Boost XG, Artificial Neural Network ANN, TabNet TN) and Logistic Regression (LR) were employed. The models were trained using three predefined subsets of the training dataset as follows: (I) the baseline dataset (BL) consisting of pre-, intra-, and postoperative data, including the TNM but excluding tumor biomarkers, (II) clinical data accessible at the time of the initial diagnostic workup (primary staging dataset, PS), and (III) the PS dataset including tumor biomarkers from tissue microarrays (PS + biomarkers), excluding TNM status. We used permutation feature importance for feature selection to identify only important variables for AI-driven reduced datasets and subsequent model retraining. RESULTS: Model training on the BL dataset demonstrated similar predictive performances for all models (Accuracy, ACC: 0.73/0.74/0.76/0.75/0.73; AUC: 0.78/0.82/0.83/0.80/


Introduction
Personalized medicine, in the era of digital patient data, has entered a new phase with the emergence of Artificial Intelligence (AI).However, AI-guided medical treatment has not developed its full potential yet.This is attributable to external factors, including legal concerns related to data protection, as well as to internal factors, notably, the gap between data scientists' methodologies and the domain knowledge of medical staff.The process of selecting suitable AI methods and evaluating their applicability to medical inquiries has not yet been standardized and remains a learning curve for the medical community [1].
It is crucial to understand that AI confers significant advantages to the medical field by effectively handling extensive datasets to recognize patterns [2].AI can be categorized into Machine Learning (ML) and Deep Learning (DL), both of which have been previously used in studies concerning patients with upper gastrointestinal cancer [3][4][5].ML uses specific algorithms trained on a data sample to construct predictive models, such as ensemble methods based on decision trees like Random Forests or Gradient Boosting.DL, a subset of ML, is more complex and necessitates greater computational power.It is modeled after the human brain structure and excels in processing various data types, such as images, language, and tabular data [2,6].
A pivotal attribute of AI is rapid and individual data analysis, a quality of increasing importance in medical and oncological treatment [7].The economic structures of the healthcare system demand time-efficient therapeutic approaches.Moreover, wellinformed patients seek prompt and timely answers, particularly when confronted with a life-threatening disease such as upper gastrointestinal cancer.
Esophageal cancer (EC) is the eighth most common cancer globally.Surgical therapy remains the primary curative approach for locally advanced EC [8].Furthermore, the overall survival (OS) benefits from additional neoadjuvant treatments, such as radiochemotherapy [9,10].Currently, survival probabilities are primarily determined based on the pathological Tumor, Node, Metastasis (TNM) stage groups [11], and the long-term prognosis remains poor with a 5-year survival rate of approximately 20% [12,13].
The treatment choice for the patient is determined during the interdisciplinary tumor board conference after primary staging [14].Factors including tumor histology from the primary biopsy, radiologically observed nodal and organ metastases, and patient comorbidities are pivotal in this decision-making process [15].Nevertheless, this approach seems overly simplistic considering the wealth of additional, collectable data, such as various patient characteristics and tumor biomarkers.In particular, tumor biomarkers are gaining increasing significance in the treatment of EC such as the assessment of the programmed death-ligand 1 (PDL-1) status for targeted therapy [16].
In this context, AI could function as a valuable tool to investigate the relationship between the patient's medical history and the histopathological specifics of the tumor disease, facilitating personalized therapy.Our institution, as a high-volume center for EC surgical treatment, offers the opportunity for AI-driven analysis of extensive patient cohorts with a large number of specified biomarkers.This study's objective is to predict 5-year survival status by comparing various AI algorithms trained on different data subsets.Initially, pre-, intra-, and postoperative clinical information, including pathological TNM, is used to train the respective models.Subsequently, the models are trained on the preoperative information obtained during the initial diagnosis (primary staging), a period when the pathological TNM status is not yet available.Then, biomarker analysis is incorporated into the data from the primary staging to assess its impact on the predictive power.Lastly, AI-driven feature selection is conducted to identify important variables for predictions.

Inclusion Criteria and Patient Characteristics
For this retrospective study, a total of 1002 patients with EC (adenocarcinoma AC 84.03%; squamous cell carcinoma SCC 14.7%; other carcinomas 1.2%) who underwent primary surgical treatment or surgery after neoadjuvant therapy between 1996 and 2021 at the Department of General, Visceral and Cancer Surgery, University of Cologne, Germany, were included (Table 1).The standard surgical approach involved laparotomic, laparoscopic, or robotic gastrolysis with the following right transthoracic en-bloc esophagectomy and two-field lymphadenectomy of mediastinal and abdominal lymph nodes (Ivor Lewis Esophagectomy).The inclusion criteria encompassed patients with a post-surgery OS of at least 90 days to exclude mortality due to postoperative complications.We decided to include features with a maximum missing rate of 87% (missing values per feature: mean 0.53, SD 0.22) to avoid prematurely excluding clinical attributes that could be crucial for AI-based predictions.Preliminary studies with a data completeness threshold of 75% per feature, resulting in fewer features but a lower missing rate, yielded inferior outcomes compared with the approach with more features and a higher missing rate.
Written consent to data collection in a clinical and pathological database was obtained from all patients prior to treatment.As this is a retrospective study, all data, including biomarkers, were already fully collected at the beginning of this study.This study was performed in accordance with the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of the University of Cologne (16-230, 9 September 2016).

Models
In this study, supervised learning techniques for binary classification included two Machine Learning (ML) methods, Random Forest and XG-Boost, as well as two Deep Learning (DL) algorithms, Artificial Neural Networks and TabNet.Logistic Regression (LR) served as the classical and state-of-the-art statistical method.The scikit learn package was utilized for constructing the models unless otherwise specified [38].
Random Forests (RFs) belong to ensemble ML algorithms, relying on multiple decision trees to classify the target.A key concept in RFs is bootstrap aggregation, commonly known as bagging.This involves creating subsamples of the training data with distinct sets of features (decision trees) to enhance model performance [39].Besides easy implementation, RFs are robust against overfitting even in the presence of high-dimensional data, as seen in our dataset [40].
Extreme Gradient Boosting (XG-Boost, XG) is an additional decision tree-based ensemble method used for supervised learning of tabular data.Besides bagging, XG-Boost places an emphasis on sequentially boosting correctly classified subsamples, thereby enhancing predictions for subsequent learners.Additionally, XG Boost demonstrates the capability to handle missing values [41], proving beneficial for our dataset with a notable proportion of missing values.
Artificial Neural Networks (ANNs) are constructed as a feed-forward network of different nodes (input-, hidden-, output-layers) to finally interpret the information (dataset) by improving the weights during training known as backpropagation [42].Neural networks are not as frequently used for tabular data as ML methods.Nevertheless, ANNs were selected for this study given their ability to handle complex patterns [42] such as those present in medical datasets.In this study, the fast.ailibrary was utilized to create a feed-forward ANN for classifying the 5-year OS [43].The ANN architecture included a maximum of two layers and up to 140 nodes, depending on hyperparameter search.
TabNet (TN) is the latest method used in this study to include another deep architecture model alongside ANNs.It was first introduced in 2019 by a research team of Google Cloud with the objective of bridging the gap between DL techniques and tabular datasets, which had predominantly been utilized for training ML models.TN's architecture processes, transforms, and selects the features in sequential, nonlinear decision steps (Feature/Attentive Transformer) for final classification [44].In this study, a PyTorch implementation of TN was used [45].
Logistic Regression (LR) is a well-known statistical approach using the logistic function for dichotomous classification [46].LR without regularization was selected as the state-ofthe-art model to benchmark the performance of the ML and DL approaches.

Labeling, Data Splitting, and Data Preprocessing
The original dataset was labeled into two groups based on the 5-year OS after surgical treatment.Short-term survival was defined as an OS greater than 90 days but less than five years with recorded death (Label 0).Long-term survival was designated when the OS was equal to or greater than five years (Label 1).This yielded a fairly balanced dataset (Label 0: 596 patients, 59.5%; Label 1: 406 patients, 40.5%).
An independent hold-out set was created and consistently utilized as a test set for all subsequent models (n = 100).The final training set comprised 902 patients.Validation sets were derived from the training dataset using stratified sampling, with the stratification being based on the two cohorts (n = 91).This was performed particularly for architectures like ANN, TN, and XG and for AI-driven feature selection.
The features comprised 14 continuous variables and 96 categorical variables.Continuous data were first normalized, and then missing data were imputed with scikit learn's k-Nearest Neighbor imputer (n-neighbors = 10) [38], except in the XG model, which can handle missing continuous data [41], and in the ANN model, where the median was imputed utilizing the FillMissing method [43].For the ML methods (RF, XG) and LR, categorical data were transformed into dummy variables through one-hot encoding, resulting in a total of 301 features.One-hot encoding involves converting a categorial variable into its categories, thus creating new variables.DL architectures (ANN, TN) used embeddings for categorical variables.Missing data points in the categorical features were treated as their own category.

Hyperparameter Search and Feature Selection via Permutation Feature Importance
Hyperparameter optimization was performed using scikit learn's Randomized and Grid Search Cross Validation (CV) with a stratified 10-fold approach for RF and XG [38].Optimum hyperparameter values for ANN and TN were determined using Optuna, an open-source optimization framework based on pruning and sampling with a customdefined number of trials (n = 100) [47].Optimization was conducted for each computational experiment.An outline of optimized hyperparameters is provided in Supplementary Table S2.
Permutation feature importance (PFI) was utilized to rank the importance of features for model performance.This method involves randomly shuffling the features (n = 100) and evaluating the reduction in model performance [48].PFI was applied to the validation set in this study to identify the important features.PFI was conducted using the scikit learn library for RF, TN, and XG, while a modified code was used for fast.ai'sANN [38,49].

Study Design
The study is structured as follows (Figure 1): 1.
We created three predefined data subsets from the training set for model training as follows: (a) Baseline dataset (BL): All clinical data, including information collected pre-, intra-, and postoperatively, as well as the pathological TNM status (n features = 55).(b) Two preoperative data subsets for model training to assess predictive performance as follows: -Primary staging dataset (PS dataset, n features = 29): This included only variables collected during primary staging until the time of the tumor board conference.It did not involve histopathological assessment.-PS dataset plus tumor biomarkers (PS dataset + biomarkers, n features = 84).
As there was no histopathological assessment available from the initial tumor biopsy, biomarkers from the tumor sample after surgical treatment were used.

2.
We set in this study to identify the important features.PFI was conducted using the scikit learn library for RF, TN, and XG, while a modified code was used for fast.ai'sANN [38,49].

Study Design
The study is structured as follows (Figure 1): As there was no histopathological assessment available from the initial tumor biopsy, biomarkers from the tumor sample after surgical treatment were used.The baseline dataset (BL) contains all available clinical data, including pathological TNM but excluding immunohistochemical biomarker analysis.In the primary staging dataset (PS), information collected after the initial diagnosis was omitted.Another PS dataset was created, this time incorporating tumor biomarkers.Notably, the pathological TNM is not included in the PS datasets.This study proceeded in the following two steps: 1. Models were trained on the respective data subsets (n = 902) to predict 5-year survival status.2. Feature selection via PFI was performed on both the BL dataset and the PS dataset containing biomarkers.The important features identified were utilized to create reduced datasets on which models were retrained for survival predictions.Predictions were always made on the independent test set (n = 100).XG = Extreme Gradient Boosting, RF = Random Forest, ANN = Artificial Neural Network, TN = TabNet, n feat = number of features in datasets.

Statistical Analysis
Data analysis was performed with Python (version 3.8.8)using the pandas (Version 1.4.3)[50], NumPy (version 1.21.5)[51], matplotlib (version 3.5.1)[52], and scikit learn (version 1.0.2) [38] packages.Model performance was evaluated in a two-fold manner.First, 10-fold cross-validation was performed on the whole training set to obtain a measure of how well the models generalize.The cross-validation score (CV-score) represents the mean of the 10-fold accuracies and is presented along with its standard deviation (SD) in this study.Second, the trained models were tested on the independent test set.Therefore, accuracy (ACC) with its 95% confidence intervals (95% CI) and receiver operating characteristic curves (ROC) with their corresponding area under the curve (AUC) were calculated.

AI Models Effectively Predict 5-Year Survival Using Clinical Data and Pathological TNM
The initial model training was carried out on the baseline dataset (BL) to predict long-term survival exceeding five years (Figure 2, Table 2).This predefined data subset included all available clinical data and the pathological TNM status.The clinical data consisted of preoperative data such as the medical history and staging, intraoperative data, and postoperative data such as complications and the neoadjuvant treatment, if applicable.
Table 2. Predictions on the test set (ACC, AUC) and cross-validation accuracy on the distinct training subsets (CV-score), namely, the baseline dataset (BL), the primary staging dataset (PS), and the PS dataset including tumor biomarkers (PS + biomarkers).The BL dataset contains all available clinical data including the pathological TNM status but not the tumor biomarkers.The PS dataset contains clinical data available at the primary diagnosis workup (= primary staging).RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, LR = Logistic Regression.cross-validation was performed on the whole training set to obtain a measure of how well the models generalize.The cross-validation score (CV-score) represents the mean of the 10fold accuracies and is presented along with its standard deviation (SD) in this study.Second, the trained models were tested on the independent test set.Therefore, accuracy (ACC) with its 95% confidence intervals (95% CI) and receiver operating characteristic curves (ROC) with their corresponding area under the curve (AUC) were calculated.

AI Models Effectively Predict 5-Year Survival Using Clinical Data and Pathological TNM
The initial model training was carried out on the baseline dataset (BL) to predict longterm survival exceeding five years (Figure 2, Table 2).This predefined data subset included all available clinical data and the pathological TNM status.The clinical data consisted of preoperative data such as the medical history and staging, intraoperative data, and postoperative data such as complications and the neoadjuvant treatment, if applicable.TN demonstrated similar predictive performance on the test set (ACC: 0.75, AUC: 0.8) but exhibited reduced generalizability compared with the other AI models (CV-score [SD]: 0.66 [0.03]).LR showed results similar to the better-performing AI models (ACC: 0.73; AUC: 0.79; CV-score [SD]: 0.73 [0.05]).

Including Biomarkers into Early Clinical Data Demonstrates Similar Model Performance Compared to Comprehensive Clinical Data including the TNM Status
We proceeded to assess whether the known clinical and diagnostic features until the primary staging work-up, with or without the inclusion of tumor biomarkers, are sufficient for classifying long-term versus short-term OS (Figure 2, Table 2).The predictive performance of all models decreased notably when trained on the PS dataset excluding the histopathological and clinical parameters after primary staging.Particularly noteworthy is the drop in generalizability for all models except ANN, with CV scores ranging between 0.65 and 0.67 for RF, XG, and TN.
Surprisingly, when the tumor biomarkers were integrated into the PS dataset for model training, predictions improved.Accuracies demonstrated improvements, rising from 0.7 to 0.77 for RF, from 0.73 to 0.79 for XG, from 0.71 to 0.75 for ANN, and from 0.69 to 0.72 for TN (see Table 2: PS dataset vs. PS dataset + biomarkers).
ANN not only displayed enhanced predictive performance on the test set but also exhibited good generalizability when trained with the PS dataset with biomarkers (ACC: 0.75, AUC: 0.86, CV-score ± SD: 0.76 ± 0.03).XG and RF exhibited enhanced predictive performance on the test set, achieving accuracies of 0.79 and 0.77, respectively, when biomarkers were included in the PS dataset.However, although their generalizability improved, it did not reach the level observed when trained on the BL dataset, as indicated by their CV scores of 0.69 and 0.68, respectively.LR exhibited the least accurate predictions when trained on the PS dataset (AUC: 0.7), and unlike the AI models, it did not demonstrate improvement when tumor markers were incorporated into the PS dataset (AUC: 0.69).
It is noteworthy that the predictive accuracy of the AI models subsequent to the incorporation of the biomarkers into the PS dataset became similar again to the model performance after training on the BL dataset, which included the pathological TNM status (see Table 2: PS dataset + biomarkers vs. BL dataset).

Models Trained on AI-Driven Data Subsets with Important Features Achieve Constant Predictive Performance
We conducted feature selection using PFI on both the BL dataset and the PS dataset containing biomarkers.The important features identified differ for each model after training on the respective predefined data subsets.The derived important features for both data subsets encompass a combination of clinical and histopathological data and are presented in detail in Figures 3 and 4. A detailed description of all included features can be found in Supplementary Table S1.
With AI-driven feature selection, we identified important features for each AI model (RF/XG/ANN/TN) to create new subsets from the BL dataset and the PS dataset with biomarkers.The original BL dataset, initially consisting of 55 features, was reduced to 23/26/27/28 features, and the PS dataset with biomarkers was reduced from 84 to 38/37/38/41 features for RF/XG/ANN/TN, respectively.The model performances after training on the respective AI-driven data subsets did not decline (Table 3, Figure 5) in comparison to using all available features from the original data subsets.The accuracies of the AI models ranged between 0.73 and 0.76 when trained on the entire BL dataset and between 0.7 and 0.76 when trained on the respective AI-driven data subsets.Similarly, accuracies of the AI models ranged between 0.72 and 0.79 when trained on the entire PS dataset with biomarkers, and between 0.74 and 0.78 after training with the respective AI-driven data subset.Table 3. Predictions on the independent test set (ACC, AUC) and cross-validation accuracy on the respective training sets (CV-score) using the AI-driven reduced datasets after feature selection.Feature selection via permutation feature importance was performed on the baseline dataset (BL) and the primary staging dataset including tumor biomarkers (PS + biomarkers).RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet.When comparing the AI-driven important features after PFI, five features were consistently identified in all models trained on the BL dataset: histopathological lymph node status (pN), histopathological tumor size (pT), clinical tumor size (cT), age at the time of surgery, and postoperative tracheostomy.Feature selection on the PS dataset with biomarkers yielded eight shared features in all models including the following: age at the time of surgery, TP-53 gene mutation, Mesothelin expression, thymidine phosphorylase (TYMP) expression, NANOG homebox protein expression, and indoleamine 2,3-dioxygenase (IDO) expressed on tumor-infiltrating lymphocytes, as well as tumor-infiltrating Mast-and Natural killer cells (NK cells).

Discussion
This study analyzed the potential of AI techniques in predicting the long-term survival of EC patients.Moreover, we aimed to elucidate the relevance of biomarkers derived from tissue microarray analysis of post-surgical tumor specimens in predicting survival outcomes.We hypothesize that these biomarkers have the same predictive power as the pathological TNM status.
In our study, we demonstrate that the 5-year survival status can be predicted at a satisfactory and comparable level with an accuracy exceeding 0.73 and an AUC exceeding 0.78 using AI models such as RF, XG, ANN, and TN.The cross-validation accuracies (CVscore) of the distinct models closely aligned with the accuracies on the test set, indicating robust generalizability of the models.
These results are in accordance with previous studies by two Asian research groups.Gong et al. [3] achieved an AUC of 0.85, 0.84, and 0.83 for 5-year survival prediction using XG, ANN, and RF, respectively, with reported cross-validation accuracies higher than our study, ranging between 0.86 and 0.87.This disparity could be attributed to their notably larger dataset including more than 10,000 patients.However, our study encompassed 113 features with over half of them being biomarkers.In contrast, the referenced study group incorporated only 21 features from a database predominantly centered on some clinical data and basic histopathological information, lacking biomarkers.
Similarly, Sato et al. [5] reported an AUC of 0.88 for an ANN in predicting the 5-year survival of EC patients.This study group focused on neural networks with different architectures and did not explore other AI methods.To our knowledge, TN has never been utilized for predicting the survival status of EC.However, TN showed the least predictive capability among all AI methods tested in this study.
Although we utilized a dataset with a high missing rate, we still observed the constant and satisfactory predictive ability of the AI models.Not only predictions on the test set but also constant CV accuracies reflect the models' ability to generalize the data even when certain information is missing.Notably, XG Boost is known to handle missing continuous values effectively.Contrarily, the other models required imputation of the continuous variables, posing the risk of introducing biases such as skewing the data towards outliers or not reflecting the true values [53].This concern warrants caution, particularly if new diagnostic approaches are to be based on biased findings.AI has not yet become an integral part of routine medical treatment and decisionmaking.Nevertheless, in recent years, numerous studies have aimed to demonstrate the advantages of these techniques, particularly in providing personalized predictions for individual patients [54,55].Still, medical guidelines rely on studies based on statistical tests.Classic statistical tests help to understand the relationship between a data sample and a population but are less effective in making personalized predictions [56].In this study, LR, chosen as the classical statistical approach for comparison, demonstrated inferior performance when trained on data representing the early stage of oncological treatment.Other previous studies have also shown that statistical tests, such as linear discriminant analysis for survival status [5] or traditional Cox regression models for survival prediction [4,57] in patients with EC, were outperformed by ML and DL methods.
However, AI methods also pose potential sources of bias, with overfitting being a notable concern.Overfitting occurs when models learn the training data too well, lacking the ability to generalize effectively to new data.A high-dimensional dataset, like ours, may increase the risk of overfitting.To address this bias, we utilized specific techniques.Firstly, we selected models such as RF or XG, which are less susceptible to overfitting [40,41], or deeper models such as ANN or TN that utilize an additional validation set.Secondly, we used hyperparameter optimization through Randomized or Grid Search Cross for the ML models [38] and an automated hyperparameter tuning tool (Optuna) [47] for the DL models.Additionally, we assessed cross-validation accuracy on the training set to evaluate the models' generalization capability and compared it to their performance on the test set, as discussed later.
The robust predictive efficacy of all models, observed when trained with pre-, intra-, and postoperative data, can be attributed to the inclusion of the pathological TNM classification.In all AI models, both the pT and pN features were identified as crucial for predictions after feature selection.However, the pathological TNM status alone is insufficient for accurately classifying long-term survival status with AI, as demonstrated in the study by Sato et al. [5].The authors found significantly poorer predictions when using ANNs trained only with the pathological TNM status compared with networks that incorporated additional data, such as pre-or postoperative clinical information.
The tumor board's recommendation following the primary staging work-up plays a pivotal role in determining the subsequent treatment for EC [14].Estimating survival probabilities at this early stage of oncological therapy without relying on TNM staging would be of significant interest.Thus, we asked if data available up to the time of the tumor board (results of the initial CT scan or endoscopy, tumor size in endoscopy next to the medical history, and patient baseline characteristics) is sufficient for predicting long-term survival.As previously mentioned, the predefined primary staging datasets did not include pathological TNM.
All models trained exclusively with known clinical features at the time of primary staging showed a decline in performance.However, when we incorporated tumor biomarkers into the primary staging dataset, predictive performance improved.RF, XG, and ANN demonstrated similarly robust predictions on the test set.Among them, ANN additionally exhibited good generalizability, as indicated by consistent cross-validation accuracy in contrast to the other models that exhibited inferior CV accuracies, suggesting overfitting.Model training on the baseline dataset, including pathological TNM information but not tumor biomarkers, and training on the primary staging dataset with the addition of tumor biomarkers produced similar results.This suggests that biomarkers have the potential to replace TNM for survival predictions.
The results of this study indicate that postoperative information about the tumor tissue is crucial for predicting survival status, whether in terms of TNM or tumor biomarkers.It is important to note that the tumor biomarkers used in this study were not assessed from the initial histology through endoscopic biopsy but from the tumor specimen after surgery, often following neoadjuvant therapy.As a result, biomarker expression may have changed, reflecting the impact of neoadjuvant therapy on tumor biology.In the case of the tumor response to neoadjuvant therapy, the tumor tissue undergoes changes, potentially altering the expression of biomarkers, which subsequently may differ from those observed in the primary biopsy.
Nevertheless, our experiments demonstrate that biomarkers alongside early clinical data hold comparable predictive value as the well-established TNM status combined with pre-, intra-, and postoperative clinical data.To predict survival at the time of the primary diagnosis, we propose analyzing the important tumor biomarkers identified in this study in future primary biopsies.We anticipate that biomarker analysis at this early time point will offer similar predictive value as those obtained from the final tumor specimen.Nevertheless, this hypothesis warrants confirmation.
To explore the most influential variables for survival prediction, we used permutation feature importance on the validation sets.The features identified by each model were then used to create AI-driven feature subsets, and the model's performance was assessed on the independent test set.Interestingly, the predictive performance of the models remained consistent with the AI-driven data subsets, suggesting that the predictive performance of AI models is not highly dependent on the quantity of features.Other investigators reported similar findings with either comparable [4] or even improved performance [5] using AIdriven reduced datasets.However, the specific features that are crucial for predictions may not be evident from the outset of model training.Therefore, a two-step process, initially including all available data and then identifying important features, is recommended.
The feature selection method utilized in this study was PFI.It is important to acknowledge the pitfalls associated with this method.PFI operates under the assumption that individual features are independent and uncorrelated [58].In the context of a medical dataset, this assumption does not reflect reality and may result in the omission of actual important information.Hence, various feature selection methods need to be compared before integrating them into a diagnostic workflow.
The comparison of important features across the models revealed five shared features in the BL dataset (pN, pT, cT, patient age at surgery, postoperative tracheostomy) and eight common features in the PS dataset including tumor biomarkers (patient age, TP-53 mutation, Mesothelin expression, TYMP expression, NANOG expression, IDO expressed on tumor-infiltrating lymphocytes, tumor-infiltrating mast and NK cells).
Lymph node involvement (pN) indicates an advanced tumor stage and has been documented as a predictive factor for the survival of patients with EC [59][60][61][62], a finding consistent with our study.Besides nodal and distant metastasis, tumor infiltration (pT) determines tumor stage, which reflects survival probabilities [11].The clinical T (cT) status plays a significant role in determining therapy strategies, yet survival predictions based on cT remain uncertain [11].
Furthermore, age has been recognized as an important variable by other researchers who utilized ML models for survival predictions in EC patients [3][4][5].Postoperative tracheostomy, indicative of major postoperative complications and stays in the intensive care unit, aligns with findings by Jung et al. [4], who identified those two features as important AI-driven predictors for survival in patients with upper gastrointestinal cancer.Our findings suggest that tracheostomy is an early determinant for late outcomes and that surgical complications may affect OS.
In this study, we placed particular emphasis on exploring the predictive significance of an extensive set of specified biomarkers in conjunction with other patient data for longterm survival.This aspect distinguishes our study from others that also have utilized AI methods to investigate the survival of patients with EC.
Previous in vitro studies primarily examined individual biomarkers concerning OS in patients with EC.However, the strength of AI techniques lies in its capacity to analyze all available biomarkers in combination with clinical data, enabling it to identify complex patterns and relationships that may not be apparent through traditional methods.
The expression of IDO on tumor-infiltrating lymphocytes was found to have a positive impact on OS in patients with esophageal AC [22].High expression of NANOG, a transcription factor physiologically associated with pluripotency [63], and TYMP, a promotor of tumor angiogenesis, [64] in SCC are related to poor OS.The mentioned biomarkers were detected following AI-driven feature selection in all models.These findings emphasize the potential clinical relevance of these biomarkers in the context of predicting survival outcomes.
While previous studies did not find a correlation between TP53 mutation [31] and mesothelin expression in either SCC or ACC [26] with OS, this study revealed that TP53 mutation and mesothelin expression were important features in predicting 5-year survival status in all models.This may suggest that these biomarkers, while not individually predictive, become relevant when considered in combination with other markers, as presented in the datasets of this study.
A positive correlation between tumor-infiltrating NK cell density [65] as well as an inverse correlation between mast cell density [66,67] in esophageal SCC and OS was reported.In this study, all models trained on the PS dataset including the tumor biomarkers identified mast and NK cells in the tumor microenvironment as important variables for classifying 5-year survival.Nevertheless, this study does not provide a deeper understanding of the interaction and activation status of these immune cells in EC.These biomarkers, derived from the final tumor sample and evaluated in reduced datasets, could potentially serve as a basis for future examination in primary biopsies.The assessment of multiple tumor markers on primary biopsies is constrained compared to the final tumor specimen, which evidently provides a larger tissue volume for analysis.Nevertheless, our study provides the opportunity to concentrate on the identified important biomarkers and investigate them in a primary biopsy.Additionally, consideration of the entire tumor sample, including the surrounding healthy tissue, and transition zone, enables a comprehensive understanding of the tumor environment.Therefore, biomarkers related to the surrounding tumor microenvironment, like IDO on tumor-infiltrating lymphocytes [22], should be considered in primary biopsies, involving targeted sampling in healthy, surrounding tissue.
In summary, the findings of this study indicate that early survival prediction in cancer treatment is feasible when additional histopathological information about the tumor is taken into account.The strength of this study lies in the integration of clinical patient data with biomarkers for analysis with AI methods.The biomarkers utilized in this study, in conjunction with early clinical data, exhibited similar predictive capability for long-term survival when compared with comprehensive data from various time points of oncological treatment combined with the pathological TNM status.This offers an opportunity to further explore their predictive value, which may become a valuable tool for personalized medicine in the future.

Limitations
This study intentionally incorporated features with a substantial percentage of missing values of clinical relevance.In the field of data science, there are no established guidelines concerning an acceptable threshold for missing values in a dataset, which thus remains a field of empirical testing.Preliminary studies revealed that models trained with fewer missing values, but consequently, fewer input features declined in performance.To address the issue of missing data and prevent bias, missing values in categorical variables were handled by considering them as a distinct category during the training process.
Permutation feature importance (PFI) was utilized as a tool for feature selection in this study.It is important to note that PFI may be less suitable for models trained on correlated features, as it can introduce a bias by distributing importance among correlated features.
In the case of ML models, one-hot encoding, which involves converting a categorical variable into its individual categories, offers a better understanding of the importance of each individual category of a feature.However, it does not provide insights into whether the presence (TRUE) or absence (FALSE) of a dummy variable is responsible for the prediction.Making statements about the importance of specific categories becomes even more challenging with DL methods, as these models preserve the structure of categorical variables and represent their categories using embeddings, making it difficult to directly assess category importance.
The biomarkers investigated in this study were derived from post-surgical tumor specimens, often after neoadjuvant therapy, rather than from the initial endoscopic biopsy.Future research should prioritize the analysis of biomarkers from early tumor biopsies to investigate survival predictions based on tumor characteristics.
This study represents an initial effort to integrate clinical data with an extensive array of biomarkers for the purpose of AI-guided survival prediction.Some of the identified important biomarkers have been analyzed before by study groups of our clinic regarding their prognostic value in EC [22,26,31].Our study did not delve deeply into providing a comprehensive explanation for the selection of specific biomarkers as crucial predictors of overall survival (OS) in patients with EC.This limitation is partly associated with the chosen method, namely PFI, for feature selection.PFI does not specify the aspects of the feature space that may be important for survival prognostication.Addressing this issue is a relevant objective for future investigations, with the aim of improving the model's comprehensibility.

1 .
We created three predefined data subsets from the training set for model training as follows: (a) Baseline dataset (BL): All clinical data, including information collected pre-, intra-, and postoperatively, as well as the pathological TNM status (n features = 55).(b) Two preoperative data subsets for model training to assess predictive performance as follows: -Primary staging dataset (PS dataset, n features = 29): This included only variables collected during primary staging until the time of the tumor board conference.It did not involve histopathological assessment.-PS dataset plus tumor biomarkers (PS dataset + biomarkers, n features = 84).

2 .
We performed feature selection via PFI based on the BL dataset or the PS dataset with biomarkers.The important variables identified were used to create reduced datasets for model retraining (BL: n features = 23/26/27/28; PS + biomarkers: n features = 38/37/38/41 for RF/XG/ANN/TN, respectively).3.After model training on the distinct data subsets, predictions were always made on the independent test set.

Figure 1 .Figure 1 .
Figure 1.Flow diagram of the study design.The baseline dataset (BL) contains all available clinical data, including pathological TNM but excluding immunohistochemical biomarker analysis.In the primary staging dataset (PS), information collected after the initial diagnosis was omitted.Another PS dataset was created, this time incorporating tumor biomarkers.Notably, the pathological TNM is not included in the PS datasets.This study proceeded in the following two steps: 1. Models were trained on the respective data subsets (n = 902) to predict 5-year survival status.2. Feature selection via PFI was performed on both the BL dataset and the PS dataset containing biomarkers.The important features identified were utilized to create reduced datasets on which models were retrained for survival predictions.Predictions were always made on the independent test set (n = 100).XG = Extreme Gradient Boosting, RF = Random Forest, ANN = Artificial Neural Network, TN = TabNet, n feat = number of features in datasets.

Figure 2 .
Figure 2. ROC curves of AI models.The models were trained on the following three predefined data subsets: the baseline dataset (BL) that included all available clinical features with pathological TNM status (pTNM) but excluded tumor biomarkers, the primary staging dataset (PS) with information available during the initial diagnosis, and the PS dataset containing tumor biomarkers.Notably, the AI models exhibited a decline in predictive performance when trained exclusively with clinical information during the primary staging (PS dataset).However, the addition of tumor biomarkers to the PS dataset led to improved predictions, aligning with the performance observed when the AI models were trained on the BL dataset, which includes pTNM.LR was outperformed by the AI models when trained on the PS datasets.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, LR = Logistic Regression.

Figure 3 .
Figure 3. Important features identified with permutation feature importance (PFI) after training AI models on the baseline dataset.These features identified for each model accounted for reduced datasets to classify overall survival.Features of ensemble models (RF, XG) are presented as dummy variables.Shared important features between all models are marked in pink.A detailed description of the other features displayed can be found in Supplementary Table S1.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, pN = histopathological lymph node status, pT = histopathological tumor size, cT = clinical tumor size.

Figure 3 .
Figure 3. Important features identified with permutation feature importance (PFI) after training AI models on the baseline dataset.These features identified for each model accounted for reduced datasets to classify overall survival.Features of ensemble models (RF, XG) are presented as dummy variables.Shared important features between all models are marked in pink.A detailed description of the other features displayed can be found in Supplementary Table S1.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, pN = histopathological lymph node status, pT = histopathological tumor size, cT = clinical tumor size.

Figure 4 .
Figure 4. Important features identified with permutation feature importance (PFI) after training AI models on the primary staging dataset including biomarkers.These variables served as reduced datasets for model training to predict survival status.Features marked in pink represent shared important features between all models.A detailed description of the other features displayed can be found in Supplementary Table S1.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, NK cells = Natural Killer cells, NANOG = NANOG homebox protein expression, TYMP = thymidine phosphorylase expression, TP-53 = TP-53 gene expression, IDO = indoleamine 2,3-dioxygenase expressed on tumor-infiltrating lymphocytes.

Figure 4 .
Figure 4. Important features identified with permutation feature importance (PFI) after training AI models on the primary staging dataset including biomarkers.These variables served as reduced datasets for model training to predict survival status.Features marked in pink represent shared important features between all models.A detailed description of the other features displayed can be found in Supplementary Table S1.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet, NK cells = Natural Killer cells, NANOG = NANOG homebox protein expression, TYMP = thymidine phosphorylase expression, TP-53 = TP-53 gene expression, IDO = indoleamine 2,3-dioxygenase expressed on tumor-infiltrating lymphocytes.

Figure 5 .Table 3 .
Figure 5. ROC curves of AI models after training on AI-driven data subsets.Feature selection w performed on both the baseline dataset and the primary staging dataset, which included omarkers.The important features identified were then used to create reduced subsets for retraini the models.Notably, the model performance after training with these AI-driven data subsets mained consistent compared to when trained on the respective original data subsets.RF = Rando Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet.Table 3. Predictions on the independent test set (ACC, AUC) and cross-validation accuracy on t respective training sets (CV-score) using the AI-driven reduced datasets after feature selection.Fe ture selection via permutation feature importance was performed on the baseline dataset (BL) a the primary staging dataset including tumor biomarkers (PS + biomarkers).RF = Random Fore XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet.

Figure 5 .
Figure 5. ROC curves of AI models after training on AI-driven data subsets.Feature selection was performed on both the baseline dataset and the primary staging dataset, which included biomarkers.The important features identified were then used to create reduced subsets for retraining the models.Notably, the model performance after training with these AI-driven data subsets remained consistent compared to when trained on the respective original data subsets.RF = Random Forest, XG = Extreme Gradient Boosting, ANN = Artificial Neural Network, TN = TabNet.

Table 1 .
Basic outline of patients with esophageal cancer who underwent surgical therapy.ACC = Adenocarcinoma, SCC = Squamous cell carcinoma, pT = histopathological tumor size, pN = histopathological lymph node status, pL = invasion into lymphatic vessels, pV = invasion into vein.