A Cost-Effective Model for Predicting Recurrent Gastric Cancer Using Clinical Features

This study used artificial intelligence techniques to identify clinical cancer biomarkers for recurrent gastric cancer survivors. From a hospital-based cancer registry database in Taiwan, the datasets of the incidence of recurrence and clinical risk features were included in 2476 gastric cancer survivors. We benchmarked Random Forest using MLP, C4.5, AdaBoost, and Bagging algorithms on metrics and leveraged the synthetic minority oversampling technique (SMOTE) for imbalanced dataset issues, cost-sensitive learning for risk assessment, and SHapley Additive exPlanations (SHAPs) for feature importance analysis in this study. Our proposed Random Forest outperformed the other models with an accuracy of 87.9%, a recall rate of 90.5%, an accuracy rate of 86%, and an F1 of 88.2% on the recurrent category by a 10-fold cross-validation in a balanced dataset. We identified clinical features of recurrent gastric cancer, which are the top five features, stage, number of regional lymph node involvement, Helicobacter pylori, BMI (body mass index), and gender; these features significantly affect the prediction model’s output and are worth paying attention to in the following causal effect analysis. Using an artificial intelligence model, the risk factors for recurrent gastric cancer could be identified and cost-effectively ranked according to their feature importance. In addition, they should be crucial clinical features to provide physicians with the knowledge to screen high-risk patients in gastric cancer survivors as well.


Introduction
Gastric cancer is the fourth leading cause of cancer-related mortality worldwide, and its 5-year survival rate is less than 40%.H. pylori remains the leading cause of gastric cancer, which could vary from lymphomas, sarcomas, gastrointestinal stromal, and neuroendocrine tumors.With the development of biomarkers, more and more therapeutic strategies have been designed to treat gastric cancer.However, early detection of recurrent gastric cancer can fundamentally improve a patient's survival, so it is essential to continue screening and monitoring even after the patient is disease-free.Early detection will allow physicians to provide early treatments for maximum benefit if cancer tumors recur in the stomach or elsewhere.Meanwhile, doctors and patients have always given importance to the issue of how to observe cancer recurrence with caution.In Taiwan, early detection and diagnosis have become feasible due to cancer screening promotion in recent years.
Artificial intelligence (AI), with the improvement of computing capacity, has been paid attention to in the medical field.Machine learning (ML), which is part of AI, is critical in helping diagnosis and prognosis.Zhang et al. [1] developed a multivariate logistic regression analysis on a nomogram with radiomic signature and clinical risk factors to predict early gastric cancer recurrence.Liu et al. [2] used the Support Vector Machine classifier on the gene expression profiling dataset to predict gastric cancer recurrence and identify correlated feature genes.Zhou et al. [3] benchmarked the algorithms of Random Forest (RF), GBM, Gradient Boosting, decision trees, and Logistics.They concluded that the first four factors affecting postoperative recurrence of gastric cancer were body mass index (BMI), operation time, weight, and age.
We developed a risk prediction model for survivors of recurrent gastric cancer based on the above trends.In our study, we propose Random Forest [4] to develop a classifier and benchmark for Multilayer Perceptron (MLP) [5], C4.5 [6], AdaBoost [7], and Bootstrap [8] and Aggregation (Bagging [9]) algorithms in metrics.Regarding data preprocessing, we leverage Synthetic Minority Oversampling Technology (SMOTE) [10] oversampling against imbalanced dataset issues.For strategic risk assessment, we use cost-sensitive learning as a trade-off tool.Lastly, SHapley Additive exPlanations (SHAPs) [11][12][13] are used for feature importance analysis from both global and local perspectives.
Finally, RF with SMOTE on this dataset can perform well, and SHAPs can show reasonable interpretation and feature importance.Furthermore, the top five risk factors for recurrent gastric cancer were identified as stage, number of regional lymph node involvement, Helicobacter pylori, BMI, and gender.

Data Preparation and Machine Learning Models
A hospital cohort of 2476 patients diagnosed with gastric cancer survivors was enrolled in the Taiwan Cancer Registry (TCR) database from July 2008 to August 2020.Of them, 432 recurrent gastric cancers were used compared to 2044 non-recurrent survivors.All patients underwent curative surgery for gastric cancer, following standardized procedures for tumor resection and lymph node dissection.Therefore, there was no subgroup of patients who underwent local excision surgery.Our database was analyzed in accordance with the Taiwan Cancer Registry Coding Manual Long Form Revision in 2018.This manual is published and revised by the Health Promotion Administration, Ministry of Health and Welfare.All hospitals in Taiwan register cancer data using this manual to provide information to the Taiwan Cancer Registry Center for integration.All clinical figures recorded in this database were evaluated and used to establish our predictive model.These clinical figures of gastric cancer survivors were used as the predictive features.In this study, the proposed process flow diagram is illustrated in Figure 1.
stomach or elsewhere.Meanwhile, doctors and patients have always given importance to the issue of how to observe cancer recurrence with caution.In Taiwan, early detection and diagnosis have become feasible due to cancer screening promotion in recent years.
Artificial intelligence (AI), with the improvement of computing capacity, has been paid attention to in the medical field.Machine learning (ML), which is part of AI, is critical in helping diagnosis and prognosis.Zhang et al. [1] developed a multivariate logistic regression analysis on a nomogram with radiomic signature and clinical risk factors to predict early gastric cancer recurrence.Liu et al. [2] used the Support Vector Machine classifier on the gene expression profiling dataset to predict gastric cancer recurrence and identify correlated feature genes.Zhou et al. [3] benchmarked the algorithms of Random Forest (RF), GBM, Gradient Boosting, decision trees, and Logistics.They concluded that the first four factors affecting postoperative recurrence of gastric cancer were body mass index (BMI), operation time, weight, and age.
We developed a risk prediction model for survivors of recurrent gastric cancer based on the above trends.In our study, we propose Random Forest [4] to develop a classifier and benchmark for Multilayer Perceptron (MLP) [5], C4.5 [6], AdaBoost [7], and Bootstrap [8] and Aggregation (Bagging [9]) algorithms in metrics.Regarding data preprocessing, we leverage Synthetic Minority Oversampling Technology (SMOTE) [10] oversampling against imbalanced dataset issues.For strategic risk assessment, we use cost-sensitive learning as a trade-off tool.Lastly, SHapley Additive exPlanations (SHAPs) [11][12][13] are used for feature importance analysis from both global and local perspectives.
Finally, RF with SMOTE on this dataset can perform well, and SHAPs can show reasonable interpretation and feature importance.Furthermore, the top five risk factors for recurrent gastric cancer were identified as stage, number of regional lymph node involvement, Helicobacter pylori, BMI, and gender.

Data Preparation and Machine Learning Models
A hospital cohort of 2476 patients diagnosed with gastric cancer survivors was enrolled in the Taiwan Cancer Registry (TCR) database from July 2008 to August 2020.Of them, 432 recurrent gastric cancers were used compared to 2044 non-recurrent survivors.All patients underwent curative surgery for gastric cancer, following standardized procedures for tumor resection and lymph node dissection.Therefore, there was no subgroup of patients who underwent local excision surgery.Our database was analyzed in accordance with the Taiwan Cancer Registry Coding Manual Long Form Revision in 2018.This manual is published and revised by the Health Promotion Administration, Ministry of Health and Welfare.All hospitals in Taiwan register cancer data using this manual to provide information to the Taiwan Cancer Registry Center for integration.All clinical figures recorded in this database were evaluated and used to establish our predictive model.These clinical figures of gastric cancer survivors were used as the predictive features.In this study, the proposed process flow diagram is illustrated in Figure 1.In step 1, we collected a recurrent gastric cancer dataset.For a better fit to our machine learning algorithms and TreeSHAP analysis, we encoded all our data into categorical variables in step 2. Due to the data imbalance between the non-recurrence and recurrence categories, we balanced the dataset with using upsampling method, SMOTE, for the  In step 1, we collected a recurrent gastric cancer dataset.For a better fit to our machine learning algorithms and TreeSHAP analysis, we encoded all our data into categorical variables in step 2. Due to the data imbalance between the non-recurrence and recurrence categories, we balanced the dataset with using upsampling method, SMOTE, for the minor category as step 3.One of our essential purposes is to understand the prediction behaviors of our trained model; therefore, we used 10-fold cross-validation without a dataset split in step 4. RF is the focus that is used to compare other baseline algorithms.In the final step, step 5, we further utilized model interpretation to observe feature importance and interactions.
RF is a kind of ensemble machine learning model based on Classification and Regression Trees (CART), Bagging, and random feature selection.This randomness design is suitable for preventing the overfitting of other decision trees and tolerance against noise and outliers.The tree elements make decisions based on information gained to obtain automatic feature selection.Namely, RF has a built-in feature selection function within the training phase, so there is no need to prefilter all features via principal component analysis.We can also interpret that selection in the feature importance analysis section.The importance of RF features is a permutation approach that measures the decrease in prediction performance as we permute the value of the feature.
RF has become one of the most popular machine learning algorithms based on the above properties.It has several advantages: (1) It is more accurate due to the ensemble approach.(2) It works without detailed hyperparameter setting and principal components analysis (PCA) preprocess.(3) It computes efficiently and quickly.These are the reasons we chose RF as our backbone algorithm.SMOTE (Synthetic Minority Oversampling Technology) is an oversampling algorithm proposed by Chawla et al. to improve overfitting.The method creates new randomly generated samples between samples of minority classes and their neighbors, which can balance the number among categories.
Cost-sensitive learning [14][15][16] is a training approach that considers the assigned costs of misclassification errors while training the model.This is closely related to the imbalanced dataset.We need different biases when we need other monitoring criteria in various risk management phases.The computing speed of RF is efficient, so we can easily integrate different cost biases for interested scenarios to obtain an overall picture.In this study, we refer to a machine learning approach that considers the costs associated with different types of classification errors.In medical contexts, misclassifying specific outcomes can have significant consequences, both in terms of patient health and healthcare resources.Therefore, the goal of cost-sensitive learning in medical informatics is to optimize the classification model in a way that minimizes the overall cost, which may include considerations such as misclassification costs, treatment costs, patient preferences and risks, and medical resource allocation.By incorporating these considerations into the learning process, cost-sensitive learning in medical informatics aims to develop more accurate and practical models that better align with real-world constraints and objectives.This can ultimately lead to improved patient outcomes and more efficient use of healthcare resources.
SHAP analysis is an extension based on Shapley values, which is a game-theoretical method used to calculate the average of all marginal contributions in all coalitional combinations.Unlike the calculation of Shapley values, SHAPs address the addictive features as a linear model.For a given feature set, SHAP values are calculated for each feature and added to a base value to provide the final prediction.
Global interpretation in SHAPs can provide insights into the overall importance of different features in the model.By calculating SHAP values for each feature across a large dataset, we can determine which features contribute the most to predictions on average.This helps us to understand which risk factors are most influential across the entire population.The waterfall plot helps visualize the impact of each feature on the model's output across the dataset.These plots allow us to see the direction and magnitude of each feature's effect on predictions, providing a comprehensive overview of the model's behavior.
Local interpretation in SHAPs can also provide explanations for individual predictions, allowing us to understand why a particular prediction was made for a specific patient.By calculating SHAP values for each feature for a single prediction, we can see how each feature contributed to that prediction.The plot helps visualize the contribution of each feature to the difference between the model output and the average output.It provides a detailed breakdown of how each feature affects the prediction for a given instance, making it easier to understand the reasoning behind individual predictions.
Using SHAPs for both global and local interpretation in medical risk factor prediction can help clinicians and researchers gain valuable insights into how the model works, which features are most important, and why specific predictions are made.This transparency and interpretability is crucial for building trust in predictive models and for making informed decisions in clinical settings.
TreeSHAP is proposed as a variant of SHAPs for tree-based machine learning models, such as decision trees, RF, and gradient-boosted trees.
The importance of a SHAP feature is defined as the average of absolute SHAP values per feature for all instances.This focuses on the variance of model output, which is different from the importance of a permutation feature based on performance error.From it, we can know how the magnitude of model output changes, like for the likelihood or regression values, as we manipulate feature values, and it has nothing to do with performance, like accuracy or loss.

Dataset Sources
In the database of the Taiwan Cancer Registry, 17 variables are recorded as clinical potential features of recurrence: (1)  (17) lymphatic or vascular.Therefore, the analysis aimed to identify the most critical risk features of these 17 predictors.
Our database was analyzed in accordance with the Taiwan Cancer Registry Coding Manual Long Form Revision in 2018.This manual is published and revised by the Health Promotion Administration, Ministry of Health and Welfare.All hospitals in Taiwan register cancer data using this manual to provide information to the Taiwan Cancer Registry Center for integration.Therefore, the research database is a part of the national cancer registration database.For example, we collected detailed information on smoking (code-36) and subcodes (00-99, 00-99, and 00-99) on current smoking status (the first subcode), smoking history (the second subcode), and time since quitting smoking (the third code) according to the manual.In the initial analysis, participants were categorized into smoking and nonsmoking groups regardless of their current smoking status.However, since smoking was not found to be a significant factor in the analysis, detailed smoking data are not included in this article.The original data contains detailed information on smoking duration, smoking cessation, and the time of cessation according to the Taiwan Cancer Registry Coding Manual Long Form Revision in 2018.In the preliminary analysis, participants were initially grouped into smoking and non-smoking categories, regardless of their smoking cessation status.However, subsequent analysis indicated that smoking was not a significant factor.Consequently, the detailed smoking data were not included in the article.All patients underwent curative surgery for gastric cancer, following standardized procedures for tumor resection and lymph node dissection.
The encoding and sample size features are organized in Figure 2. The original dataset has significantly imbalanced issues; that is, the minor category is about 20% of the majority.As mentioned above, the RF and Decision Tree will automatically select the best feature for each decision split.Therefore, we did not utilize PCA for dimensionality reduction.

Data Preprocessing
To better fit our machine learning algorithms and TreeSHAP analysis, we encode all of our data into categorical variables with an assigned integer.We also assign ordinal variables that are positive or have higher intensity with larger integers so that TreeSHAP can display them effectively using a trend chart or dependence plot.However, most features have missing or unavailable values that have been encoded as middle or average rank to minimize possible bias on feature importance trends.That is, the moderate impact of the NA category should be between the maximum and minimum of ordinal values.

Dataset Balancing
Due to a significantly imbalanced dataset, the model would tend to overfit the major categories and ignore the learning features of minorities.Generally, several approaches could overcome this learning bias due to target loss function design, such as assigning different weights for samples or categories in the loss function, assigning additional cost weight for prediction errors, resampling data by over-or undersampling, etc.
SMOTE is the method we use for oversampling in this study.Instead of simply duplicating samples, we generate synthetic samples for minorities up to a quantity of the majority.It would select some nearby samples around a base sample of minority, randomly choose one neighbor, and randomly perturb one feature at a time within the distance between them.

10-Fold Cross-Validation
In this study, the 10-fold cross-validation is used for performance checks.We use 10fold cross-validation to prevent bias from a split of the whole dataset.Partition the complete dataset randomly into ten equal-sized subgroups.Each time, a subgroup will be chosen as a holdout set, and the rest of the night nine subgroups are for training.At the end

Data Preprocessing
To better fit our machine learning algorithms and TreeSHAP analysis, we encode all of our data into categorical variables with an assigned integer.We also assign ordinal variables that are positive or have higher intensity with larger integers so that TreeSHAP can display them effectively using a trend chart or dependence plot.However, most features have missing or unavailable values that have been encoded as middle or average rank to minimize possible bias on feature importance trends.That is, the moderate impact of the NA category should be between the maximum and minimum of ordinal values.

Dataset Balancing
Due to a significantly imbalanced dataset, the model would tend to overfit the major categories and ignore the learning features of minorities.Generally, several approaches could overcome this learning bias due to target loss function design, such as assigning different weights for samples or categories in the loss function, assigning additional cost weight for prediction errors, resampling data by over-or undersampling, etc.
SMOTE is the method we use for oversampling in this study.Instead of simply duplicating samples, we generate synthetic samples for minorities up to a quantity of the majority.It would select some nearby samples around a base sample of minority, randomly choose one neighbor, and randomly perturb one feature at a time within the distance between them.

10-Fold Cross-Validation
In this study, the 10-fold cross-validation is used for performance checks.We use 10-fold cross-validation to prevent bias from a split of the whole dataset.Partition the complete dataset randomly into ten equal-sized subgroups.Each time, a subgroup will be chosen as a holdout set, and the rest of the night nine subgroups are for training.At the end of 10 training rounds, an average performance of 10 models will be output.We then perform the aggregation of various ordered feature importance lists based on the ranks with additional weights via the cross-entropy Monte Carlo algorithm using the Spearman distance.For each classifier, we perform upsampling to alleviate the class imbalance problem after inputting the training data.The classification metrics of the Random Forest algorithm, such as TP rate, FP rate, Precision, Recall, F1 score, and Accuracy, are illustrated in Table 1 below.Then, we benchmarked RF using MLP, C4.5, AdaBoost, and Bagging for the machine learning algorithm.

1.
MLP: A classifier that uses backpropagation to learn a Multilayer Perceptron to classify instances.

2.
C4.5:This algorithm develops a decision tree by splitting the value of the feature at each node, including categorical and numeric features.We calculated the information gain and used the feature with the highest gain as the splitting rule.

3.
AdaBoost with C4.5:It is a part of the group of ensemble methods called boosting and adds newly trained models in a series where subsequent models focus on fixing the prediction errors made by previous models.In this study, we selected C4.5 as the base classifier.4.
Bagging (Bootstrap Aggregation) with C4.5:This is an ensemble skill that uses the bootstrap sampling technique to form different sets of samples with replacement.We used C4.5 as a base classifier to derive the forest.
The RF classification model develops parallel decision trees that vote on the category judgment for a given instance and output the final decision as a prediction.Cost-sensitive learning is essential in the case of risk management as we pursue better flexibility of tradeoffs among metrics.For example, we may focus more on the recall rate of the recurrence category by loosening the performance of the precision rate.In this part, we assign different costs of the false negative error of recurrence categories 1, 2, 3, and 5 but keep the false positive error cost at 1.
In our study, we set the recurrence category as positive and then evaluated the metrics.True Positive (TP): number of positive instances predicted as positive.Negative (TN): number of negative instances predicted as negative.False Positive (FP): number of negative instances predicted as positive.False Negative (FN): number of positive instances predicted as negative.Accuracy: (TP + TN)/(TP + TN + FP + FN).Precision: TP/(TP + FP).Recall: TP/(TP + FN).F1-score: 2 × (Recall × Precision)/(Recall + Precision).False Positive Rate: FP/(FP + TN).True Positive Rate: TP/(TP + FN).The ROC curve (receiver operating characteristic curve) uses the True Positive Rate as the y-axis and False Positive Rate as the x-axis and plots points with corresponding thresholds.

Interpretability in Machine Learning Models
We review the importance of features using two approaches, RF and SHAP.The former is a single-feature permutation approach to observe model performance impact, whereas the latter is flexible in observing main features and the interaction effects regarding model output.
Concerning SHAP, we use global interpretability plots.The feature importance plot lists the importance of all features in ascending order and uses color bars with positive or negative correlation coefficients.The bee swarm plot for feature importance provides vibrant SHAP values and output impact direction information of individual points in rich colors that can help users obtain critical insights quickly.The dependence plot helpfully shows the correlations and interactions between the two features and the SHAP value trends.
The local interpretability plot, the waterfall plot, is designed to demonstrate explanations for individual instances.

Traditional Predictor Algorithms
Before we established our prediction model, we analyzed these clinical features using a traditional statistic method, which shown in the table below.Finally, we tried to establish a cost-effective prediction model for recurrent gastric cancer.

Prediction Performance
The balanced dataset has original major category 2044 instances and oversampled minor category 2044 instances.The relevant risk factors and statistical chi-square test were performed as a preliminary survey.In total, 17 clinical features-gender, age, grade/differentiation, tumor size, number of regional lymph node involvement, cancer stage, residual tumor on edge of primary site, radiation therapy, chemotherapy, BMI, smoking, alcohol drinking, Betel nut chewing, carcinoembryonic antigen CEA test value, carcinoembryonic antigen CEA difference value, Helicobacter pylori, and lymphatic or vascular invasion-are statistically significant variables.In this study, 10-fold cross-validation was used to prevent biases from splitting the entire dataset.It randomly divides the entire dataset into ten subgroups of the same size.Each time, a subgroup is selected as a holdout set, and the rest of the nine subgroups are trained.At the end of 10 training rounds, the average performance of 10 models will be produced.Then, we perform the aggregation of various ordered lists of important features based on the ranks with additional weights via the Monte Carlo cross-entropy algorithm using the Spearman distance.For each classifier, we performed upsampling to alleviate the problem of class imbalance after inputting the training data.Table 3 summarizes the comparison of different algorithms, including MLP, C4.5, AdaBoost with C4.5, Bagging with C4.5, and RF.RF has a metric outperformance of F1 of 88.2%, ROC area of 95.2%, PRC of 95.4%, and 87.9% for the recurrence category.First, MLP shows better metric performance than C4.5 as a benchmark between the numericalbase algorithm and the categorical-base decision tree algorithm.Second, from the evidence that RF significantly surpasses the C4.5 and Bagging sets, we believe that the main improvement is from the randomness design of bootstrap subsampling and the choice of feature for splitting node.Third, compared with AdaBoost, the independent trees of RF show better ensemble synergy than the boosting ensemble that arranges trees of the forest in series.From the comparison of the ROC curves in Figure 3a, RF has the largest 0.952 area, which means that the trade-off between TPR and FPR through threshold setting is relatively better than others, while C4.5 is the worst.As a result, we chose Random Forest to establish our prediction model.In Table 4, we can see that the recall rate increases (that is, 90.5% to 96.1%) if we increase the penalty cost on the FN (false negative) error of the recurrence category but keep the FP error cost as one precision rate (that is, 86% to 74.2%); the overall accuracy (that is, 87.9% to 81.4%) is compromised accordingly.In addition, the performance metrics of the Random Forest algorithm with different costs are compared and listed in the Table 4 below.These include the TP rate, FP rate, Precision, Recall, F1 score, ROC area, PRC area, MSE (Mean Square Error), and Accuracy.This policy scenario means that we hope that potential recurrence patients are labeled as much as possible, with acceptable results of FP (false positive) patient misclassification.In Figure 3b, it also shows no noticeable difference between different costs.

Interpretability
Before diving into the of features, we need to understand that model interpretability is not always equal to causality.It is essential to address that SHAP values do not provide causality but instead provide insights into how the model behaves from data learning.
First, we used RF to observe global characteristics and it shows that F-6/F-5/F-2 are the top three impact features in Figure 4a.However, these feature importances did not show a positive/negative impact on recurrent gastric cancer.Therefore, the SHAP was applied and its value shows more information in Figure 4b.Firstly, it agrees that F-6 and F-5 have a critical and positive impact, which means that higher features bring bigger recurrence probability; meanwhile, F-2 falls to sixth with a negative impact.Moreover, the F-16 also moves up to the third feature with a negative impact.The red bars in Figure 4b have a positive correlation coefficient.Next, we further investigated the SHAP distribution of all instances in Figure 5a; the bee swarm plot, F-6, and F-5 show higher feature value instances (i.e., red points contribute positive SHAP values), mostly with higher SHAP values.On the contrary, the feature values negatively correlate with the SHAP values (i.e., blue points contribute positive SHAP values).
In Figure 5b, the breakdown of the dependence plot shows the SHAP value in relation to the main effects and interaction effects if we want to investigate the interaction between features further.The on-diagonal values are the main effect, whereas the off-di- Next, we further investigated the SHAP distribution of all instances in Figure 5a; the bee swarm plot, F-6, and F-5 show higher feature value instances (i.e., red points contribute positive SHAP values), mostly with higher SHAP values.On the contrary, the feature values negatively correlate with the SHAP values (i.e., blue points contribute positive SHAP values).
between features further.The on-diagonal values are the main effect, whereas the off-diagonal values are interactions.From the upper left corner of Figure 5b, we can see that interactions between F-5/F-6 would bring a negative SHAP value.In other words, F-5/F-6 would have positive main SHAP values, respectively, from the diagonal, but the off-diagonal interaction value would somehow decrease the main values.
Global interpretation shows correlations between features and prediction in all samples, which cannot clearly explain the prediction specific to a specific sample.In our study, we found competitive performance of this predicting model by using the top 10 clinical features.In Figure 5b, the breakdown of the dependence plot shows the SHAP value in relation to the main effects and interaction effects if we want to investigate the interaction between features further.The on-diagonal values are the main effect, whereas the off-diagonal values are interactions.From the upper left corner of Figure 5b, we can see that interactions between F-5/F-6 would bring a negative SHAP value.In other words, F-5/F-6 would have positive main SHAP values, respectively, from the diagonal, but the off-diagonal interaction value would somehow decrease the main values.
Global interpretation shows correlations between features and prediction in all samples, which cannot clearly explain the prediction specific to a specific sample.In our study, we found competitive performance of this predicting model by using the top 10 clinical features.
For local interpretation in SHAPs, we can provide explanations for individual predictions in the recurrent case in Figure 6a and non-recurrent case in Figure 6b, allowing us to understand why a particular prediction was made for a specific patient.In Figure 6a, it locally indicates the breakdown of individual SHAP values of different clinical features of the recurrent case; from that, we can clearly see how those feature forces bring the prediction probability f(x) up to 1.In contrast, in Figure 6b, it shows reverse forces of different clinical features of the non-recurrent case bringing f(x) down to 0. Most importantly, in everyday work, we can analyze individual patients using RF and local SHAP analysis, as shown in Figure 6, to understand the impacts of all clinical features.
As a result, we established a cost-effective prediction model for recurrent gastric cancer using 8 of 17 clinical features and the model has competitive efficiency and performance.In real world data, especially for malignancies, there are numerous clinical features that need to be evaluated in relation to cancer recurrence.Using this method, it could be possible to use several leading clinical features to establish a reliable prediction model with the same performance instead of using all clinical features.it locally indicates the breakdown of individual SHAP values of different clinical features of the recurrent case; from that, we can clearly see how those feature forces bring the prediction probability f(x) up to 1.In contrast, in Figure 6b, it shows reverse forces of different clinical features of the non-recurrent case bringing f(x) down to 0. Most importantly, in everyday work, we can analyze individual patients using RF and local SHAP analysis, as shown in Figure 6, to understand the impacts of all clinical features.As a result, we established a cost-effective prediction model for recurrent gastric cancer using 8 of 17 clinical features and the model has competitive efficiency and performance.In real world data, especially for malignancies, there are numerous clinical features that need to be evaluated in relation to cancer recurrence.Using this method, it could be possible to use several leading clinical features to establish a reliable prediction model with the same performance instead of using all clinical features.

Discussion
More than 60% of gastric cancer survivors experienced recurrence after curative resection for gastric cancer, especially within two years after surgery.The risk factors for the recurrence patterns of different clinical or pathological factors were supposed to lead to recurrences of gastric cancer [17].However, how to predict the recurrence of gastric cancer is an ongoing issue.
Lo et al. [18] found that the most critical risk factors for recurrence in early gastric cancer are lymph node status and the size of the mucosal tumor.Compared to advanced gastric cancer, the prognosis for patients with early gastric cancer is excellent.In our study, the result of F-4 (tumor size)/F-5 (number of regional lymph node involvement)/F-6 (cancer stage) with positive SHAP values is similar to their report.
In 2009, Tokunaga et al. [19] described that a 5-year survival rate after curative gastrectomy is better in overweight patients compared to non-overweight patients.Being overweight was revealed to be an independent prognostic factor in patients with early gastric cancer; however, the reason has not yet been determined.In our prediction model, F-10 (BMI) is the fourth leading feature with a negative impact on the recurrence of gastric cancer.The results were similar.
In 2020, Zheng et al. [20] identified that the pathological tumor (pT) stage and the pathological nodal (pN) stage were significantly associated with prognosis of stage I

Discussion
More than 60% of gastric cancer survivors experienced recurrence after curative resection for gastric cancer, especially within two years after surgery.The risk factors for the recurrence patterns of different clinical or pathological factors were supposed to lead to recurrences of gastric cancer [17].However, how to predict the recurrence of gastric cancer is an ongoing issue.
Lo et al. [18] found that the most critical risk factors for recurrence in early gastric cancer are lymph node status and the size of the mucosal tumor.Compared to advanced gastric cancer, the prognosis for patients with early gastric cancer is excellent.In our study, the result of F-4 (tumor size)/F-5 (number of regional lymph node involvement)/F-6 (cancer stage) with positive SHAP values is similar to their report.
In 2009, Tokunaga et al. [19] described that a 5-year survival rate after curative gastrectomy is better in overweight patients compared to non-overweight patients.Being overweight was revealed to be an independent prognostic factor in patients with early gastric cancer; however, the reason has not yet been determined.In our prediction model, F-10 (BMI) is the fourth leading feature with a negative impact on the recurrence of gastric cancer.The results were similar.
In 2020, Zheng et al. [20] identified that the pathological tumor (pT) stage and the pathological nodal (pN) stage were significantly associated with prognosis of stage I gastric cancer.The postoperative chemotherapy adjuvant might help improve the outcomes of high-risk patients.In our study, more clinical figures were collected and divided into subgroups for analysis.F-6 (cancer stage) is the leading risk factor, followed by F-5 (number of regional lymph node involvement).A similar result was concluded.
Helicobacter pylori, which is not only the leading risk factor for gastric cancer but also a high risk of recurrence, colonizes the gastric mucosa and induces persistent chronic gastric inflammation [21,22].Patients with a genotype of high IgG1 will have a higher risk of recurrence than patients with other genotypes.Our study demonstrated that SSF3 (Helicobacter pylori) is the third leading feature for the recurrence of gastric cancer.
Artificial intelligence has become a workhorse for cancer diagnosis and prognosis with unprecedented accuracy, which is more powerful than traditional statistical analysis [23].Chang et al. used a stacked ensemble-based classification model to predict the second primary cancer of head and neck cancer survivors by clinical features [24].Using artificial intelligence, it is possible to develop a prediction model to determine clinical risk features, which will help clinicians screen cancer survivors before recurrence occurs.
In this study, we explore data mining using the machine learning algorithm RF on an imbalanced dataset.Furthermore, we utilize SMOTE to oversample the minority to balance and prevent the model from being biased too much on the original majority of non-recurrent instances.Regarding metric performance, RF shows better prediction capability than MLP, C4.5, AdaBoost, and Bagging.The prediction performance metrics can reach an overall accuracy of 87.9%, a recall rate of 90.5%, a precision rate of 86%, and an F1 of 88.2% in the recurrent category using 10-fold cross-validation in the balanced dataset.
Regarding cost-sensitive learning, as we increase the cost of FN, the recall rate can improve from 90.5% to 96.1%.Meanwhile, the precision rate is compromised from 86% to 74.2%, and the accuracy from 97.9% to 82.4%.Cost learning is a quick way of conventional machine learning to assess risk, while we intend to switch between different policies.
From SHAP value interpretation analysis, we can obtain insights into how models make a decision based on features and the interaction between correlated features.The identified top-five features are F-6 (cancer stage), F-5 (number of regional lymph node involvement), SSF3 (Helicobacter pylori), BMI, and gender.But, remember that model interpretation is for model behavior analysis and is not equal to causal effect analysis.The model learns from complicated correlations within the dataset, as shown in Figure 5b, and looks for the best choice to optimize an objective function.
In this study, we aimed to establish a cost-effective predicting model by using fewer clinical features.Although global interpretation shows correlations between features and prediction in all samples, which cannot clearly explain the prediction specific to a specific sample.However, in our study, we found, even by using eight clinical features, equal performance could be achieved by using our predicting model with SHAPs.For RF, we could figure out the importances of these clinical features.However, we could not realize the direction of them.Through SHAP values, the model is able to understand the positive or negative impact of these clinical features.Finally, we were able to identify the contributory risk factors of recurrent gastric cancer.
As a result, we established a cost-effective prediction model for recurrent gastric cancer using 17 clinical features and the model has competitive efficiency and performance.In real world data, especially for malignancies, there are numerous clinical features that need to be evaluated in relation to cancer recurrence.By using this method, we are able to use several leading clinical features to establish a reliable prediction model with the same performance instead of using all clinical features.
We have some limitations in this study.Firstly, data from the Taiwan Cancer Registry were collected from different hospitals and time periods.Some features have a significant number of missing values, such as lymphatic or vascular invasion.This situation may bias real trends if this feature contributes a significant SHAP value.We need to pay attention once we interpret this feature.Although there are 17 clinical features included, some risks such as surgical procedures, neoadjuvant regimens, races, and duration of smoking could not be assessed using this dataset and this could be a limitation.

Conclusions
In conclusion, RF is the best classifier of prediction capability in our study.We conclude RF with SMOTE on this dataset can reach outstanding performance and SHAPs can show good interpretation and feature importance; finally, we identify the top-five risk factors.Cost-sensitive learning could be achieved with an improvement in the recall rate from 90.5% to 96.1%, compromised precision rate from 86% to 74.2%, and accuracy from 97.9% to 82.4%.Mostly, F-6 (cancer stage), F-5 (number of regional lymph node involvement), SSF3 (Helicobacter pylori), BMI, and gender were the leading impact factors for recurrent gastric cancer.They will be helpful for physicians in detecting high-risk patients early in gastric cancer survivors.
This study used data from the Taiwan Cancer Registry Database to estimate retrospective clinical figures of gastric cancer at diagnosis and to evaluate the association with gastric cancer recurrence after the launch of targeted therapies.Despite these limitations, this study should provide an essential basis for further research.

Figure 1 .
Figure 1.The proposed process flow diagram of this study.

Figure 1 .
Figure 1.The proposed process flow diagram of this study.

Figure 2 .
Figure 2. The dataset features with encodings and sample sizes were demonstrated in our predicting analysis.

Figure 2 .
Figure 2. The dataset features with encodings and sample sizes were demonstrated in our predicting analysis.

Figure 4 .
Figure 4.The clinical features were ranked by their feature importance: (a) Random Forest feature importance; (b) SHAP value with the costs of FN = 1.(Red: positive impact; Blue: negative impact).

Figure 4 .
Figure 4.The clinical features were ranked by their feature importance: (a) Random Forest feature importance; (b) SHAP value with the costs of FN = 1.(Red: positive impact; Blue: negative impact).

Figure 5 .
Figure 5. (a) Bee swarm plot.(b) Dependence plot.Included SHAP interaction matrix between the top 10 clinical features.

Figure 5 .
Figure 5. (a) Bee swarm plot.(b) Dependence plot.Included SHAP interaction matrix between the top 10 clinical features.

Figure 6 .
Figure 6.Waterfall plot.Two examples of local interpretation: (a) the recurrent case; (b) the nonrecurrent case.f(x) is the expectation (that is, the average predictions of all instances).

Figure 6 .
Figure 6.Waterfall plot.Two examples of local interpretation: (a) the recurrent case; (b) the nonrecurrent case.f(x) is the expectation (that is, the average predictions of all instances).

Table 1 .
Classification Metrics of Random Forest derived from the confusion matrix.

Table 2 .
Traditional analysis on clinical features of recurrent gastric cancer.
* refers to statistically significant as p value < 0.05.** refers to statistically highly significant as p value < 0.001.NA: Not applicable.

Table 3 .
Comparison of classification performance for different algorithms.

Table 4 .
Comparison of the performance of the Random Forest classification performance for different costs.

Table 4 .
Comparison of the performance of the Random Forest classification performance for different costs.