A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas

Yu, Zihan; Hu, Rong; Chen, Jiaqing

doi:10.3390/app15095146

Open AccessArticle

A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas

by

Zihan Yu

¹,

Rong Hu

^1,* and

Jiaqing Chen

^1,2,*

¹

College of Mathematics and Statistics, Wuhan University of Technology, Wuhan 430070, China

²

Hubei Longzhong Laboratory, Wuhan University of Technology, Xiangyang 441100, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5146; https://doi.org/10.3390/app15095146

Submission received: 29 March 2025 / Revised: 29 April 2025 / Accepted: 3 May 2025 / Published: 6 May 2025

Download

Browse Figures

Versions Notes

Abstract

Primary thyroid lymphoma (PTL) is a rare malignancy, and this study aimed to develop a prognostic prediction model for PTL using deep learning algorithms while providing interpretable analyses. Machine learning models were employed for mortality risk prediction, with the SHAP framework introduced for feature interpretation, and a DeepSurv model was constructed for comparison with the Cox proportional hazards (Cox-PH) model. Model performance was evaluated using Harrell’s c-index, ROC curves, AUC, calibration curves, and decision curve analysis (DCA). Results revealed that age, ‘B’ symptoms, histological type, and marital status were the most influential factors affecting patient mortality risk, as identified through SHAP analysis, and the DeepSurv model outperformed the Cox model in predicting the test set (consistency indices 0.758 vs. 0.739 and 0.789 vs. 0.779). Additionally, a web application platform was developed based on the DeepSurv model to predict the 5-year survival rate of PTL patients, facilitating the transition from theoretical research to clinical application. This study highlights the potential of deep learning models, particularly DeepSurv, in improving prognostic predictions for PTL and provides a practical tool for guiding clinical treatment decisions. The findings underscore the value of integrating interpretable machine learning frameworks into survival analysis for rare cancers.

Keywords:

primary thyroid lymphoma; survival analysis; Cox-PH; DeepSurv

1. Introduction

Primary thyroid lymphoma (PTL) is a rare malignant lymphoma that affects the thyroid gland. It represents between 2% and 5% of all thyroid malignancies [1,2] and between 2% and 3% of all extra nodal lymphoma. Victoria et al. reported that diffuse large B-cell lymphoma (DLBCL) is the most common histological type, accounting for 68% of PTL cases. This is followed by follicular lymphoma (10%), marginal zone or mucosa-associated lymphoid tissue (MALT) lymphoma (10%), and small lymphocytic lymphoma (3%) [3].

According to Victoria et al., the median survival of PTL patients was 11.6 years. The study population was predominantly female (68%) and white (93%), with a median age of 65.8 years. Previous studies have identified Myc/Bcl-2 protein co-expression, treatment modality, and rituximab as independent prognostic factors [4].

The primary symptom of PTL is neck swelling accompanied by enlarged cervical lymph nodes, which typically results in obstructive symptoms [5]. In addition, approximately 10% of patients may experience Type B symptoms, such as fever and weight loss [6]. PTL is typically diagnosed through surgical biopsy or ultrasound-guided needle biopsy [1,4].

PTL is typically treated with a combination of chemotherapy and radiotherapy due to the potential complications and incomplete resection of residual tissue that may result from surgery [1,7,8]. Only a few cases that require substantial reduction of the tumor are treated with surgery. Treatment that includes both chemotherapy and radiotherapy is commonly known as Combined-Modality Therapy (CMT). Studies have shown that CMT can significantly reduce distant recurrence [9]. Chemotherapy typically involves the use of CHOP in combination with rituximab [1,3]. For patients experiencing airway obstruction, tracheotomy or tracheal stent implantation is typically performed [8].

In survival analysis studies in the medical field, the time-to-event variable is employed as a core response variable to assess the likelihood of an individual to experience a target clinical event and the time of its occurrence. In instances where a target event is not observed in a study subject due to factors unrelated to the study, the data for that individual are considered censored [10]. The target event is defined as any clinically meaningful and observable endpoint event. In this study, the target event is defined as the death of the patient due to cancer.

From a statistical perspective, the time-to-event variable is modeled as a random variable and analyzed by conventional parameter estimation methods, including the derivation of the cumulative distribution function and the risk function. The risk function, in particular, is instrumental in delineating the conditional probability density of the occurrence of the target event within a brief period following a specified time point. These methodologies form the foundation for statistical inference on time-to-event variables and their distributional characteristics within a specified population.

Machine learning (ML) techniques have great potential in thyroid cancer research, providing new solutions for diagnosis, metastasis prediction, prognosis, and treatment personalization, which may significantly change the treatment of thyroid cancer [11]. Various ML models have been applied to various datasets containing clinical, biochemical and ultrasound derived features.

With the rising incidence of thyroid cancer, determining whether a detected tumor is malignant or benign is gradually becoming a major challenge. Olatunji et al. [12] used ML techniques for the early detection of thyroid cancer in the pre-symptomatic stage. The dataset they used was collected from King Fahd Specialized Hospital in Dammam, Saudi Arabia, and the final results showed that Random Forest had the highest prediction accuracy (90.91%). Luong et al. [13] investigated the application of ML in the prediction of malignancy of indeterminate thyroid nodules using non-invasive assay data and the results showed good performance of Random Forest. Past studies have less frequently used optimization algorithms to optimize hyperparameters for machine learning, and in a study by Książek [14], the Naked Mole Mouse algorithm was applied for the first time to optimize the parameters of a machine learning model and perform feature selection. The results of the study show that the naked mole rat algorithm has great potential to improve the accuracy of machine learning models, which also provides ideas for subsequent research.

Most patients with differentiated thyroid cancer recur even after initial treatment, and identifying individuals at high risk of recurrence is crucial for optimal patient management. Schindele et al. [15] combined XGBoost and SHAP for predicting the probability of recurrence in patients, which improved the identification of patients with a high risk of tumor recurrence. Atay et al. [16] demonstrated the effectiveness of regularized class association rules (RCAR) in predicting thyroid cancer recurrence.

Yang et al. [17] introduced a new machine learning approach to create a prognostic system for cancer patients, which was significantly more accurate in predicting survival than the AJCC staging scheme. Barfejani et al. [18] evaluated the performance of five ML algorithms in predicting short-term survival in ATC patients and found that these models could potentially guide clinical decisions and individualized treatment strategies.

The utilization of machine learning in healthcare has the potential to enhance the efficacy and precision of medical diagnosis. However, deep learning and machine learning models are typically regarded as opaque, with limited interpretability, making it challenging to comprehend the influence of features on model predictions, which is crucial for disease diagnosis [19]. Consequently, the SHAP approach was employed to elucidate the variables that influence the mortality risk in patients with primary thyroid lymphoma.

Currently, machine learning (ML) has been employed to predict the overall survival (OS) and cancer-specific survival (CSS) of PTL [4,20,21]. However, these models depend on the traditional Cox proportional hazard model (Cox-PH), requiring manual feature selection and consuming significant time and effort. Additionally, it is important to note that Cox-PH is a semiparametric model that can only be used to explore linear relationships between covariates [22]. However, the relationship between covariates and survival outcomes is generally nonlinear. Therefore, the aim of this study is to assess the feasibility of using the deep learning (DL) model DeepSurv for PTL survival analysis based on the Surveillance, Epidemiology, and End Results (SEER) dataset. Through the DeepSurv model, physicians can assess the survival probability of patients to adjust treatment measures in a timely manner. In addition, by building a web-based application platform, doctors can input patients’ clinical data in real time, and the model will automatically generate personalized survival prediction results, providing more comprehensive support for precision medicine.

2. Materials and Methods

2.1. Data Collection

The study data were obtained from 17 cancer registries in the SEER database (version 8.4.3) from 2004 to 2015. PTL was diagnosed using the third edition of the International Classification of Diseases for Oncology (ICD-O-3), and the primary thyroid site code was C73.9. The present histologies include lymphoma, not otherwise specified (NOS) (9590–9591), composite-site Hodgkin’s lymphoma (9596), small lymphocytic B cell (9670–9671), mantle cell lymphoma (9673), and mixed diffuse B cell (9675). This study included patients with thymic large B cell lymphoma (9679), DLBCL (9680–9684), Burkitt’s lymphoma (9687), follicular lymphoma (9690–9698), marginal zone lymphoma (9699), and T-cell lymphoma (9702–9714). Patients with only autopsy records or only death certificates, as well as those with unknown survival time, were excluded from the study. The variables collected from the SEER database for analysis included age, sex, race, income, marital status, region, histological subtype, distant metastasis surgery, pathological grade, AJCC stage, surgery, tumor metastasis, radiotherapy records, order of surgery and radiotherapy, chemotherapy records, systemic ‘B’ symptoms at diagnosis, number of malignant tumors, degree of tumor invasion, and survival status, state, and survival time. The data selection process is shown in Figure 1.

2.2. Data Preprocessing

A total of 1184 patients were enrolled in the study. The patients were randomly assigned to the training set (n = 832) and the validation set (n = 352) in a 7:3 ratio. Categorical variables were coded in an unordered manner, while continuous variables were retained in their original form. Missing values were filled using Random Forest.

2.3. Method of Feature Selection

In order to remove noise interference and irrelevant features among the many features and to improve the accuracy of the model, three methods with different principles were initially employed: single-factor and multi-factor Cox regression analysis, a recursive feature elimination method, and Boruta feature selection. These were used for feature screening, with the features that appeared two or more times being selected as the final features.

2.4. The Treatment of Category Imbalance

In the context of binary classification, category imbalance refers to a significant disparity in the number of samples between two categories. However, in the case of medical data, it is more challenging to achieve perfectly balanced data directly. In order to enhance the predictive accuracy of the model, it is necessary to process the imbalanced data before modeling. The SMOTE algorithm was employed for oversampling, the ClusterCentroids algorithm for downsampling, and the SMOTE Tomek link algorithm for combinatorial sampling. The data were processed using these algorithms, and the model was trained on the processed data. The sampling technique that yielded the best results was selected for mortality risk prediction and survival analysis.

2.5. Development of Machine Learning Model

As a medical aid, machine learning algorithms can scientifically assist doctors in making medical diagnoses that are more accurate and patient-friendly. However, different classification algorithms have obvious differences in classification effects when dealing with different types of datasets. Therefore, the most appropriate classification algorithm model should be selected based on the characteristics of the data when targeting a specific disease type. Therefore, this paper compares the modeling of mortality risk of primary thyroid lymphoma patients based on the Support Vector Machine (SVM), Logistic Regression, Random Forest, Gradient Boosted Tree (GBDT), and Extreme Gradient Boosting (XGBoost) algorithms; selects the most effective model for mortality risk prediction; and performs interpretability analysis using SHAP theory. Python (version 3.9) was used, and the packages we used were as follows: numpy, Scikit-learn, pandas, xgboost, shap, and matplotlib.

To optimize the performance of the machine learning model, we performed hyperparameter tuning using cross-validation on the training set. The following hyperparameters were tuned:

(1)

Support Vector Machine (SVM)

Kernel function: linear;
Error convergence condition: 0.001.

(2)

Logistic Regression

Error convergence condition: 0.001.

(3)

Random Forest

Maximum depth: 10;
Maximum number of leaf nodes: 50.

(4)

Gradient Boosted Tree (GBDT)

Learning rate: 0.1;
Loss function: deviance.

(5)

Extreme Gradient Boosting Algorithm (XGBoost)

Learning rate: 0.1;
Maximum depth: 10.

2.6. Development of the Cox-PH Model

To identify significant features, univariate and multivariate Cox regression analyses were performed. Variables with p-value < 0.05 in univariate Cox regression analysis were included in multivariate Cox regression analysis. The Cox proportional hazard models were constructed using R software (version 4.3.1). The used packages were caret, survival, plyr, MASS, rms, ggplot2, riskRegression, ggDCA, ggprism, and forestplot.

2.7. Development of the DeepSurv Model

The DeepSurv model combines neural networks and Cox proportional hazard models to more accurately predict an individual’s survival time or the probability of an event by learning the nonlinear relationship of the survival curve. The 5-layer neural network DeepSurv model was constructed using the training set and evaluated using the validation set. As Figure 2. Python (version 3.9) was used to build the model. The packages we used were numpy, matplotlib, pandas, seaborn, sklearn, and platform.

Overall, overfitting tends to be negatively correlated with the learning rate, decay rate, and L2, while the structure of the neural network, activation function, and dropout also have a large impact on overfitting, and thus some of the hyperparameters are optimized:

Learning rate: 0.067;
Decay rate: 6.494 × 10⁻⁴;
L2: 0.001;
Dropout: 0.147.

2.8. Model Performance Evaluation

For machine learning models, comparisons were performed using accuracy, recall, precision and F1. The concordance index was used to evaluate the predictive performance of the model. For the Cox-PH and DeepSurv models, the predictive performance of the models was assessed using the Harrell C-index; the specificity and sensitivity of the models were assessed using receiver operating characteristic (ROC) curves and the area under the curve (AUC); the accuracy of the models was assessed using calibration curves; and the clinical utility and net benefit of the models were assessed using decision curve analysis (DCA).

2.9. Statistical Analysis

Continuous variables are reported as means and standard deviations, while categorical variables are reported in terms of frequency and percent. The Student t-test was used for the comparison of continuous variables, and the chi-squared test was used for the comparison of categorical variables. Kaplan–Meier (K-M) survival analysis and the log-rank test were used to assess the prognostic effect of the independent factors identified by multivariate Cox regression analysis. The calculations and analyses were carried out using R software (version 4.3.1).

2.10. Deployment of Online Application Platforms

This study presents an online prediction platform for PTL using the DeepSurv model and Python’s Flask library for survival analysis and prediction applications. The platform provides the survival rate of PTL patients up to 5 years after diagnosis.

3. Results

Experiments were conducted on hardware with the following specifications:

Processor: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40 GHz 2.42 GHz;
Installed RAM: 16.0 GB;
Windows Specifications: Windows 10 Home Edition.

3.1. Patient Characteristics

A total of 1184 patients were included in the study, with more females (68.5%) than males (31.5%). White patients accounted for the largest proportion, 84.9%. Most patients were aged between 50 and 80 years. They were married, lived in large cities, and had a higher income. The most frequent histological subtype was DLBCL (56.8%), followed by marginal zone or MALT lymphoma (21.3%). Further, 36.9% of patients refused chemotherapy and 63.1% accepted chemotherapy; 44.4% of patients did not undergo surgery and 55.5% underwent surgery; and 56.8% of patients refused radiotherapy and 43.2% received radiotherapy. The clinical characteristics of the patients are shown in Table 1.

3.2. Feature Selection

The features obtained from the three feature selection methods are as follows. As Table 2.

The principles of different feature selection methods vary, leading to differences in the screened variables. To ensure the selected features are robust and highly discriminative, we retained only those identified by at least two different methods. Finally, 12 features were selected for subsequent analysis: age, gender, histological subtype, radiotherapy, chemotherapy, ‘B’ symptoms, marriage status, tumor metastasis, AJCC staging, surgery, degree of tumor infiltration, and number of malignancies.

3.3. Comparison of Sampling Techniques and Models

Based on the three sampling methods to unbalance the data, then applying machine learning algorithms to train on the sampled dataset, and validating the model on the test set, the results are as follows.

3.3.1. SMOTE Oversampling

In terms of precision, Random Forest outperforms other models with a value of 0.766, followed by XGBoost with a value of 0.761. In terms of accuracy, recall, and F1, XGBoost outperforms other models with values of 0.762, 0.769, and 0.763, followed by Random Forest with values of 0.755, 0.755, and 0.754, respectively. Combining the four evaluation metrics, the XGBoost model performs best among the data category imbalance processing methods for SMOTE oversampling, with better generalization ability. The following figure shows the confusion matrix for the XGBoost model. As shown in Figure 3 and Table 3.

3.3.2. ClusterCentroids Downsampling

In terms of accuracy, XGBoost is better than other models with a value of 0.728, followed by GBDT with a value of 0.716. In terms of recall, GBDT is better than other models with a value of 0.724, followed by XGBoost model with a value of 0. 721. In terms of precision, XGBoost is better than other models with a value of 0.727, followed by GBDT with a value of 0.714. From the F1 point of view, Random Forest is better than other models with a value of 0.715, followed by GBDT with a value of 0.714. Combining the four evaluation metrics, the XGBoost model performs best among the methods for processing the unbalanced data categories sampled in ClusterCentroids, with a better generalization ability. The following figure shows the confusion matrix for the XGBoost model. As shown in Figure 4 and Table 4.

3.3.3. SMOTE Tomek Link Combinatorial Sampling

In terms of precision, GBDT is better than other models with a value of 0.817, followed by XGBoost with a value of 0.811. In terms of accuracy, recall and F1, XGBoost is better than other models with values of 0.819, 0.816, and 0.816, respectively, followed by GBDT with values of 0.806, 0.801, and 0.806, respectively. Combining the four evaluation metrics, the XGBoost model performs best among the data category imbalance processing methods for SMOTE Tomek link combinatorial sampling, with better generalization ability. The following figure shows the confusion matrix for the XGBoost model. As shown in Figure 5 and Table 5.

In summary, the XGBoost model performed best in all three methods, SMOTE oversampling, ClusterCentroids undersampling, and SMOTE Tomek link combination sampling. Meanwhile, SVM and Logistic Regression outperformed the other two sampling methods under ClusterCentroids, and Random Forest, GBDT, and XGBoost outperformed the other two sampling methods under SMOTE Tomek link combined sampling. Therefore, mortality risk prediction was performed using XGBoost based on SMOTE Tomek link combined sampling, and survival analysis was performed using Cox-PH and DeepSurv based on SMOTE Tomek link combined sampling.

3.4. Mortality Risk Prediction Based on SHAP Interpretation Machine Learning

The XGBoost algorithm was employed in order to ascertain the risk of death, with the objective of identifying the most significant predictors. As Figure 6.

3.4.1. Interpreting the Model Globally

(1) Explanation of the Importance of Features.

The interpretation of XGBoost in the SHAP framework can be achieved by displaying a plot of the positive and negative effects of each feature. As Figure 7. The vertical coordinates in the plot indicate the different features, with each row corresponding to a feature variable. The horizontal coordinates, on the other hand, represent the SHAP values, which indicate the effect on the model output results. Each point in the graph represents a sample, and its color reflects the magnitude of the feature value taken for that sample. The color of the point in the graph is determined by the magnitude of the feature value. Points with larger feature values are redder, while points with smaller feature values are bluer. Additionally, the cross-sectional area of the “honeycomb” in the graph is determined by the number of points with the same SHAP value. A larger number of points with the same SHAP value results in a larger cross-sectional area, which presents a coarser visual effect.

As illustrated in the figure, age is the most significant predictor of mortality risk in patients with primary thyroid lymphoma, followed by the presence of “B” symptoms, histological type, and marital status. Furthermore, the figure indicates that age exerts a positive effect on the risk of death, suggesting that the older the patient, the higher the risk of death. In contrast, tumor metastasis and surgery exert a negative effect on the risk of death, indicating that these indicators play an inverse role in the risk of death of the patient.

(2) The Impact of Two-By-Two Interaction Characteristics on Prediction Outcomes.

The diagonal portion of the interaction plot illustrates the relationship between the features and the predicted values. As Figure 8. The off-diagonal portion, in contrast, depicts the effect of combining features two by two on the predicted values. The horizontal coordinate of each subplot represents the SHAP value, indicating the degree of significance of the effect of a given combination of features on the results. The greater the width of the subplot, the more pronounced the effect.

As illustrated in the figure, apart from the diagonal, the combinations of age and “B” symptom characteristics had the greatest impact on the predicted outcomes.

3.4.2. Interpreting the Model Locally

(1) Decision Path.

As Figure 9. In the SHAP decision diagram, the gray vertical line in the middle represents the base values of the model, while the colored lines represent the effect of each feature on the predicted results. These values are responsible for moving the output above or below the average prediction. The prediction lines, which commence at the base values, illustrate the accumulation of SHAP values, culminating in the final model score.

The decision diagram was plotted for the first 100 patients, which revealed that the patients began with a basic value of 0.6225. As the value of each characteristic changed, the SHAP value was affected, ultimately influencing the patient’s risk of death. The red line indicates that the patient’s risk of death is higher than the mean value due to the combined effect of the characteristics. In contrast, the blue line represents the opposite.

(2) The Interpretation of Individual Cases.

A further understanding of the direction and magnitude of the influence of individual characteristics on the risk of death in patients with primary thyroid lymphoma can be achieved by analyzing individual patients using SHAP. As Figure 10.

The results shown in the graph are logarithmically related to the predicted probabilities, i.e., f(x) = ln(p/(1 − p)). The above graph illustrates the relationship between a variable and a predicted outcome. The red color represents a positive contribution to the predicted outcome, while the blue color represents a negative impact. The size of the colored area reflects the magnitude of the impact. The probability of survival for the 30th patient is 0.5796, while the displayed result is 0.321. For this patient, the contribution of age (Age = 53) is significant, followed by number of malignancies (numberrecode = 2) and marital status (Marital = 1). The combined effect of all the characteristics resulted in a SHAP value for this patient of less than 0.6225, indicating that the risk of death for this patient was lower than the baseline risk.

The vertical axis represents the different features and the values taken by the features. The horizontal axis represents the SHAP value. (f(x)) is the expectation of all the samples. The value of (x) denotes the prediction result of the 30th patient. The red feature represents the promotion of the patient’s death, while the blue feature denotes the reduction of the patient’s death. The width of the feature indicates the degree of its influence on the outcome. Therefore, the wider the width of the feature, the greater the influence on the final predicted outcome. As shown in Figure 11, As illustrated in the preceding graph, the three most influential features for the 30th patient were age, number of malignancies, and marital status.

In conclusion, SHAP offers an efficacious interpretive analysis of XGBoost when performing mortality risk prediction in patients with primary thyroid lymphoma. In addition to ranking the overall feature importance, the tool revealed the direction of influence of each variable in the model on the prediction results. Furthermore, it enabled local analysis of individual samples. This enables doctors to gain insights into each patient’s condition, which in turn assists them in developing personalized treatment plans and targeting therapeutic measures, thereby reducing the risk of patient mortality.

3.5. Performance of the Cox-PH Model

Age, tumor metastasis, AJCC stage, radiotherapy, and the number of malignancies had a strong prognostic impact (HR > 1), with age and radiotherapy being independent prognostic factors (p < 0.05). Figure 12 illustrates the final results. Where * indicates p < 0.01 (highly significant) and *** indicates p < 0.001 (highly significant).

The c-index of OS in the Cox-PH model was 0.751 in the training set and 0.739 in the validation set, and the c-index of CSS was 0.790 in the training set and 0.779 in the validation set. As shown in Table 6.

The calibration curves of the Cox-PH model in OS and CSS were slightly deviated from the reference curves. As Figure 13. The diagonal line represents the ideal case of Perfect Calibration. This line means that the probability predicted by the model is exactly the same as the actual observed frequency of events.

The AUC of the OS training set was 0.805, 0.768, and 0.789, and that of the validation set was 0.802, 0.760, and 0.780. The AUC of the CSS training set was 0.838, 0.815, and 0.836, and the AUC of the validation set was 0.836, 0.798, and 0.815. As Figure 14. The diagonal line represents that the classifier does not have any discriminative ability and its prediction is equivalent to random guessing. If the ROC curve of the model is close to the diagonal line, it means that the model is not able to effectively discriminate between positive and negative classes.

As Figure 15. The OS training set model can help patients get 18% to 35% of net income. The model of the validation set could help patients achieve a net benefit of 18% to 35%. The CSS model of the training set can help patients achieve a net benefit of 13% to 17%. The model of the validation set could help patients achieve a net benefit of 12% to 17%.

3.6. Performance of the DeepSurv Model

For the DeepSurv model, the c-index of OS in the training set and validation set was 0.881 and 0.758, and the c-index of CSS in the training set and validation set was 0.946 and 0.789. As Table 7.

The DeepSurv model had better calibration curves in OS and CSS than the Cox-PH model. As Figure 16.

The AUC of the OS training set was 0.900, 0.906, and 0.932, and the AUC of the validation set was 0.756, 0.768, and 0.780. The AUC of the CSS training set was 0.964, 0.959, and 0.953, and the AUC of the validation set was 0.750, 0.745, and 0.770. As Figure 17.

OS models of the training set can help patients achieve a net benefit of about 20%. The model of the validation set could help patients achieve a net benefit of about 20%. CSS models of the training set can help patients achieve a net benefit of about 5%. The models in the validation set could help patients achieve a net benefit of about 5%. As Figure 18. The green horizontal line in the graph represents the “no intervention for all patients” strategy. This is to say that in the absence of any diagnostic or therapeutic measures being taken, the net benefit is always zero. The blue curve illustrates the utilisation of DeepSurv in the development of a treatment strategy, where the net benefit experiences a progressive decline as the threshold probability rises. The orange curves represent other strategies where the net benefit declines more significantly as the threshold probability increases.

3.7. K–M Survival Analysis Based on the Cox-PH Model

Patients stratified by sex, age, race, chemotherapy, distant metastasis, order of surgery and radiotherapy, degree of tumor invasion, number of malignant tumors, marital status, B symptoms, and radiotherapy underwent KM survival analysis. Patients who did not exhibit “B” symptoms had a higher CSS survival rate. Black patients had significantly lower CSS survival than patients of other races. Female patients had a better CSS survival rate than male patients. Age, chemotherapy, race, sex, degree of tumor invasion, and distant metastasis were significantly associated with CSS. The results were shown in Figure 19. A black dashed line with a vertical value of 0.50 was used to visually compare the survival probability of the two groups with the value of 0.5.

3.8. Web-Based Online Application Platform for Predicting Survival in PTL Patients

Since DeepSurv has demonstrated significant reliability and accuracy in predicting the survival rate of PTL patients, in order to help the clinic develop more effective treatment plans and shorten the time of clinical treatment, as well as to more intuitively obtain the survival rate of PTL patients in a given period of time in the future, the present study has developed an online application platform for predicting the survival rate of PTL patients on the Internet (http://47.120.75.127:5000/). (The website was built and accessed on 8 April 2024 successfully.) The steps to follow are as follows: Firstly, select the relevant parameter variables and enter your own data. Secondly, click ‘calculate’. Finally, obtain the survival prediction results of PTL patients. The platform will return the survival rate of PTL patients in the next 1-year, 3-year, and 5-year nodes and the survival curve in the next 5 years. As Figure 20. This is because the median survival of PTL patients is longer, and the 5-year survival rate is higher.

4. Discussion

The prevalence of PTL is low and relatively rare, and there are few studies on PTL. Therefore, risk stratification and individualized treatment of PTL play a crucial role in improving survival.

ML is increasingly used in the medical and health field, and recent studies have used Cox-PH model analysis to construct various prognostic models of PTL [4,21,23,24,25]. Yi J et al. used Cox regression analysis to evaluate prognostic factors and to find the best treatment plan [4]. The study concluded that surgical treatment alone did not affect the prognosis of patients with primary thyroid DLBCL. However, patients who received a combination of chemotherapy and radiotherapy had a better prognosis. Jin S et al. established and verified a nomogram for B-cell primary thyroid malignant lymphoma (BC-PTML), which had higher discrimination power and clinical benefit than the traditional Ann Arbor staging system. Jin S et al. also fitted the Cox-PH model to predict the CSS of PTL, and the c-index of the validation set reached 0.762, indicating high accuracy. At present, Cox regression analysis has become a popular prognostic prediction method, but the limitations of fitting linear survival models require more complex methods, such as DL.

Katzman et al. proved that the DeepSurv model performs as well or better than other survival models in simulated and real survival data [26]. Cheng D et al. found that the DeepSurv model was more accurate and flexible than the traditional Cox regression analysis model in predicting the survival probability and prognosis of patients with long osteosarcoma [27], with a higher c-index (0.800VS0.774). This proves the feasibility of applying the DeepSurv model to fit prognostic models. Therefore, the present study proposes the DeepSurv model to predict CSS and OS in PTL patients and contrasts its performance with the Cox-HP model.

The dataset underwent feature screening using single-factor and multi-factor Cox regression analysis, recursive feature elimination, and Boruta feature selection. This process yielded a total of 12 important features. To address the issue of imbalanced data categories, the data were processed using SMOTE oversampling, ClusterCentroids downsampling, and SMOTE Tomek link combinatorial sampling. Additionally, multiple machine learning models and multiple evaluation indexes were compared. The XGBoost model based on combinatorial sampling was found to perform the best. In light of these findings, the SHAP theoretical framework was selected to investigate the interpretability of the XGBoost model from both an overall and local perspective. The results indicated that age, the “B” symptom, histological type, and marital status were significant factors influencing patients’ risk of death. By leveraging the XGBoost model and SHAP framework, medical practitioners can proactively identify patients at elevated mortality risk. This capability enables timely implementation of preventative interventions, which subsequently reduces both consultation times and patient mortality risk.

Based on the c-index, the study concluded that the DeepSurv model was superior to the Cox-PH model in predicting survival in PTL patients. In terms of the calibration curve, the mortality risk estimated by DeepSurv was closer to the real situation than that estimated by Cox-PH. In terms of the ROC curve, the AUC value of the DeepSurv model was significantly higher than that of the Cox-PH model, demonstrating that the specificity and sensitivity of the DeepSurv model were better. The net clinical benefit was similar for both in terms of DCA curves.

Yin et al. [22] found that the DeepSurv model was superior to the traditional Cox-PH model in predicting survival in patients with malignant small bowel tumors. Huang et al. [28] compared six survival prediction models, including Randomized Survival Forest and XGBoost Survival Embeddings, and found that DeepSurv was the most accurate in predicting the prognosis and survival time of patients with ampullary adenocarcinoma of the jugular abdominal region. The findings of these studies are consistent with the findings of this paper.

In conclusion, DeepSurv has good predictive performance. In addition, DeepSurv can automatically extract features and process complex data, which is superior to Cox-PH. Therefore, this study presents a web-based prediction platform (http://47.120.75.127:5000/). (The website was built and accessed on 8 April 2024 successfully.) After the clinical staff fills in the corresponding data of the patient in the webpage, the platform will display the predicted survival rate of the patient in the next 1 year, 3 years and 5 years, thus realizing the risk stratification of the patient. Clinical workers can also select different surgical methods, radiation methods, etc. on the webpage to calculate the survival rate of patients under different circumstances, so as to choose a more suitable treatment plan for the patients. It is expected to provide clinicians with powerful tools to help respond to patient counseling, inform treatment strategies, and optimize prognosis based on patient survival, and risk stratification.

This study found that chemotherapy or radiotherapy alone could not improve the prognosis of patients. Previous research has demonstrated that a combination of radiotherapy, chemotherapy, and surgery can positively impact the outcomes of patients with PTL [25]. Further research is required to investigate more precise and effective treatment plans for various subgroups of PTL patients.

There are some limitations to this study. Firstly, it is a retrospective study based on the SEER database and lacks other populations for external validation. Obtaining data from a different geographic area would have been beneficial as it would have allowed for local factors to be taken into account, such as environmental conditions and access to healthcare. Secondly, as the clinical treatment data of the patients were not included in the SEER database, further conclusions regarding treatment could not be drawn, and we expect that more datasets will be accessible in the future, which will allow further progress in this study. In addition, primary thyroid lymphoma is a rare cancer and the small amount of data may not be sufficient to ensure a high level of generalizability of the model, which could potentially reduce the robustness of the model.

This study did not use optimization algorithms to optimize machine learning models. Several biologically inspired algorithms have been developed in recent years, such as the Auricular Fox optimization algorithm, which have limited integration with machine learning. If the optimization algorithm could be combined with the XGBoost model, the accuracy of prediction would be further improved.

5. Conclusions

This study constructs and validates a machine learning model for predicting the risk of death in patients with primary thyroid lymphoma and a deep learning model for survival probability. This study has demonstrated that the XGBoost model based on combinatorial sampling has the greatest predictive capacity for the risk of death. Furthermore, the SHAP framework has been introduced for interpretable analysis, which enables the identification of the main risk factors affecting the death of patients with primary thyroid lymphoma. This, in turn, can assist doctors in making informed decisions about medical pathology and reduce the risk of patient mortality.

Meanwhile, this study demonstrated that the DeepSurv model was superior to the traditional Cox-PH model in predicting survival. Furthermore, the DeepSurv model has the potential to be a viable and promising tool for clinical prediction. The predictive capabilities of DeepSurv will pave the way for future research in deep neural networks and survival analysis, and DeepSurv has the potential to help physicians research personalized treatment options. Therefore, this study integrates the DeepSurv model into a web-based clinical application platform that provides survival rates for PTL patients over the next 5 years. This emphasizes the practical applicability of the findings.

Although the results are encouraging, our study has several limitations. First, the dataset was retrospective and lacked data from different geographic regions for validation. Second, clinical data were difficult to obtain and may have missed treatment-related factors. In addition, the low prevalence of primary thyroid lymphoma and the small dataset may reduce the robustness of the model. We look forward to having more publicly available datasets accessible in the future.

Author Contributions

Conceptualization, Z.Y.; Data curation, Z.Y.; Formal analysis, Z.Y.; Funding acquisition, J.C.; Investigation, Z.Y.; Methodology, Z.Y.; Project administration, R.H.; Software, Z.Y.; Supervision, J.C.; Validation, R.H.; Writing—original draft, Z.Y.; Writing—review and editing, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support from the National Natural Science Foundation (grant number 81671633) and was further backed by the Open Fund of Hubei Longzhong Laboratory.

Data Availability Statement

The data presented in this study are openly available in Surveillance, Epidemiology, and End Results Program (cancer.gov).

Acknowledgments

Many thanks to the reviewers for their positive feedback, valuable comments, and constructive suggestions that helped improve the quality of this article. Many thanks for the editors’ great help and coordination for the publication of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chai, Y.J.; Hong, J.H.; Koo, D.H.; Yu, H.W.; Lee, J.-H.; Kwon, H.; Kim, S.-J.; Choi, J.Y.; Lee, K.E. Clinicopathological characteristics and treatm-ent outcomes of 38 cases of primary thyroid lymphoma: A multicenter study. Nals Surg. Treat. Res. 2015, 89, 295–299. [Google Scholar] [CrossRef]
Watanabe, N.; Narimatsu, H.; Noh, J.Y.; Iwaku, K.; Kunii, Y.; Suzuki, N.; Ohye, H.; Suzuki, M.; Matsumoto, M.; Yoshihara, A.; et al. Long-Term Outcomes of 107 Cases of Primary Thyroid Mucosa-Associated Lymphoid Tissue Lymphoma at a Single Medical Institution in Japan. J. Clin. Endocrinol. Metab. 2018, 103, 732–739. [Google Scholar] [CrossRef] [PubMed]
Vardell Noble, V.; Ermann, D.A.; Griffin, E.K.; Silberstein, T. Primary Thyroid Lympho-ma: An Analysis of the National Cancer Database. Cureus 2019, 11, e4088. [Google Scholar] [PubMed]
Yi, J.; Yi, P.; Wang, W.; Wang, H.; Wang, X.; Luo, H.; Fan, P. A Multicenter Retrospective Study of 58 Patients With Primary Thyroid Diffuse Large B Cell Lymphoma. Front. Endocrinol. 2020, 11, 542. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Liu, W.; Liu, Y.; Wang, W.; Wang, M.; Liu, H.; Li, X.; Gao, W. Diagnosis and Clinical Analysis of Primary Thyroid Lymphoma. Acta Acad. Med. Sin. 2017, 39, 377–382. [Google Scholar] [CrossRef]
Green, L.D.; Mack, L.; Pasieka, J.L. Anaplastic thyroid cancer and primary thyroid lymphoma: A review of these rare thyroid malignancies. J. Surg. Oncol. 2006, 94, 725–736. [Google Scholar] [CrossRef]
Meyer-Rochow, G.; Sywak, M.; Reeve, T.; Delbridge, L.; Sidhu, S. Surgical trends in the manage-ment of thyroid lymphoma. Eur. J. Surg. Oncol. 2008, 34, 576–580. [Google Scholar] [CrossRef]
Lai, Y.; Ding, C.; Shen, Y.; Zhao, L.; Li, H. Clinicopathological analysis of primary thyroid non-Hodgkin lymphoma: A single-center study. Transl. Cancer Res. 2023, 12, 515–524. [Google Scholar] [CrossRef]
Doria, R.; Jekel, J.F.; Cooper, D.L. Thyroid lymphoma. The case for combined moda-lity therapy. Cancer 1994, 73, 200–206. [Google Scholar] [CrossRef]
Štěpánek, L.; Habarta, F.; Malá, I.; Štěpánek, L.; Nakládalová, M.; Boriková, A.; Marek, L. Machine Learning at the Service of Survival Analysis: Predictions Using Time-to-Event Decomposition and Classification Applied to a Decrease of Blood Antibodies against COVID-19. Mathematics 2023, 11, 819. [Google Scholar] [CrossRef]
Lixandru-Petre, I.-O.; Dima, A.; Musat, M.; Dascalu, M.; Gradisteanu Pircalabioru, G.; Iliescu, F.S.; Iliescu, C. Machine Learning for Thyroid Cancer Detection, Presence of Metastasis, and Recurrence Predictions—A Scoping Review. Cancers 2025, 17, 1308. [Google Scholar] [CrossRef]
Olatunji, S.O.; Alotaibi, S.; Almutairi, E.; Alrabae, Z.; Almajid, Y.; Altabee, R.; Altassan, M.; Ahmed, M.I.B.; Farooqui, M.; Alhiyafi, J. Early diagnosis of thyroid cancer diseases using computational intelligence techniques: A case study of a Saudi Arabian dataset. Comput. Biol. Med. 2021, 131, 104267. [Google Scholar] [CrossRef]
Luong, G.; Idarraga, A.J.; Hsiao, V.; Schneider, D.F. Risk stratifying indeterminate thyroid nodules with machine learning. J. Surg. Res. 2022, 270, 214–220. [Google Scholar] [CrossRef] [PubMed]
Książek, W. Explainable Thyroid Cancer Diagnosis Through Two-Level Machine Learning Optimization with an Improved Naked Mole-Rat Algorithm. Cancers 2024, 16, 4128. [Google Scholar] [CrossRef] [PubMed]
Schindele, A.; Krebold, A.; Heiß, U.; Nimptsch, K.; Pfaehler, E.; Berr, C.; Bundschuh, R.A.; Wendler, T.; Kertels, O.; Tran-Gia, J.; et al. Interpretable Machine Learning for Thyroid Cancer Recurrence Prediction: Leveraging XGBoost and SHAP Analysis. Eur. J. Radiol. 2025, 186, 112049. [Google Scholar] [CrossRef]
Atay, F.F.; Yagin, F.H.; Colak, C.; Elkiran, E.T.; Mansuri, N.; Ahmad, F.; Ardigò, L.P. A Hybrid Machine Learning Model Combining Association Rule Mining and Classification Algorithms to Predict Differentiated Thyroid Cancer Recurrence. Front. Med. 2024, 11, 1461372. [Google Scholar] [CrossRef]
Yang, C.Q.; Gardiner, L.; Wang, H.; Hueman, M.T.; Chen, D. Creating prognostic systems for well-differentiated thyroid cancer using machine learning. Front. Endocrinol. 2019, 10, 288. [Google Scholar] [CrossRef]
Barfejani, A.H.; Rostami, M.; Rahimi, M.; Far, H.S.; Gholizadeh, S.; Behjat, M.; Tarokhian, A. Predicting overall survival in anaplastic thyroid cancer using machine learning approaches. Eur. Arch. Otorhinolaryngol. 2025, 282, 1653–1657. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Zheng, Y.-H.; Tian, B.; Qin, W.-W.; Zhu, Q.-W.; Feng, J.; Hu, W.-Y.; Chen, R.-A.; Liu, L. Distribution and survival outcomes of primary head and neck hematolymphoid neoplasms in older people: A population-based study. Clin. Exp. Med. 2023, 23, 3957–3967. [Google Scholar] [CrossRef]
Chen, E.; Wu, Q.; Jin, Y.; Jin, W.; Cai, Y.; Wang, Q.; Zhang, X.; Wang, O.; Li, Q.; Zheng, Z. Clinicopathological characteristics and prognostic factors for primary thyroid lymphoma: Report on 28 Chinese patients and results of a population-based study. Cancer Manag. Res. 2018, 10, 4411–4419. [Google Scholar] [CrossRef]
Yin, M.; Lin, J.; Liu, L.; Gao, J.; Xu, W.; Yu, C.; Qu, S.; Liu, X.; Qian, L.; Xu, C.; et al. Development of a Deep Learning Model for Malignant Small Bowel Tumo-rs Survival: A SEER-Based Study. Diagnostics 2022, 12, 1247. [Google Scholar] [CrossRef] [PubMed]
Jin, S.; Xie, L.; You, Y.; He, C.; Li, X. Development and validation of a nomog-ram to predict B-cell primary thyroid malignant lymphoma-specific survival: A population-based analysis. Front. Endocrinol. 2022, 13, 965448. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Peng, X.; Wei, T.; Li, Z.; Zhu, J.; Chen, Y.-W. Prognostic Nomogram and Competing Risk Analysis of Death for Primary Thyroid Lymphoma: A Long-term Survival Study of 1638 Patients. Ann. Surg. 2022, 3, e22. [Google Scholar] [CrossRef]
Xiang, N.; Dong, F.; Zhan, X.; Wang, S.; Wang, J.; Sun, E.; Chen, B. Incidence and prognostic factors of primary thyroid lymphoma and construction of prognostic models for post-chemotherapy and postoperative patients: A population-based study. Bmc Endocr. Disord. 2021, 21, 68. [Google Scholar] [CrossRef] [PubMed]
Katzman, J.L.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; Kluger, Y. DeepSurv: Personalized treat-ment recommender system using a Cox proportional hazards deep neural network. Bmc Med. Res. Methodol. 2018, 18, 24. [Google Scholar] [CrossRef]
Cheng, D.; Liu, D.; Li, X.; Mi, Z.; Zhang, Z.; Tao, W.; Dang, J.; Zhu, D.; Fu, J.; Fan, H. A deep learning model for accurately predicting cancer-specific survival in patients with primary bone sarcoma of the extremity: A population-based study. Clin. Transl. Oncol. 2024, 26, 709–719. [Google Scholar] [CrossRef]
Huang, T.; Huang, L.; Yang, R.; Li, S.; He, N.; Feng, A.; Li, L.; Lyu, J. Machine Learning Models for Predicting Survival in Patients with Ampullary Adenocarcinoma. Asia-Pac. J. Oncol. Nurs. 2022, 9, 100141. [Google Scholar] [CrossRef]

Figure 1. Flow chart of PTL patient data selection in SEER database.

Figure 2. Deep learning process diagram.

Figure 3. XGBoost confusion matrix after oversampling.

Figure 4. XGBoost confusion matrix after downsampling.

Figure 5. XGBoost confusion matrix after combinatorial sampling.

Figure 6. XGBoost feature importance.

Figure 7. Positive and negative effect plots for each variable in the SHAP framework.

Figure 8. The impact of two-by-two interaction characteristics on prediction outcomes.

Figure 9. Decision charts for the first 100 patients.

Figure 10. Interpretation of the SHAP model for the 30th patient.

Figure 11. Importance of characteristics of 30th patient.

Figure 12. Multivariate cox regression analysis.

Figure 13. Calibration curve of Cox-PH. (A–C) are 1-, 3-, and 5-year calibration curves for the OS training set; (D–F) are 1-, 3-, and 5-year calibration curves for the OS validation set; (G–I) are 1-, 3-, and 5-year calibration curves for the CSS training set; (J–L) are 1-, 3-, and 5-year calibration curves for the CSS validation set.

Figure 14. ROC curve of Cox-PH. (A) is the ROC curve for the OS training set, (B) is the ROC curve for the OS validation set, (C) is the ROC curve for the CSS training set, and (D) is the ROC curve for the CSS validation set.

Figure 15. DCA curve of Cox- PH. (A–C) are 1-, 3-, and 5-year DCA curves for OS training set; (D–F) are 1-, 3-, and 5-year DCA curves for OS validation set; (G–I) are 1-, 3-, and 5-year DCA curves for CSS training set; and (J–L) are 1-, 3-, and 5-year DCA curves for CSS validation set.

Figure 16. Calibration curves of DeepSurv. (A–C) are 1-, 3-, and 5-year calibration curves for the OS training set. (D–F) are 1-, 3-, and 5-year calibration curves for the OS validation set; (G–I) are 1-, 3-, and 5-year calibration curves for the CSS training set; (J–L) are 1-, 3-, and 5-year calibration curves for the CSS validation set.

Figure 17. ROC curve of DeepSurv. (A–C) are ROC curves for OS. (D–F) are ROC curves for CSS.

Figure 18. DCA curve of DeepSurv. (A–C) are 1-, 3-, and 5-year DCA curves for OS training set; (D–F) are 1-, 3-, and 5-year DCA curves for OS validation set; (G–I) are 1-, 3-, and 5-year DCA curves for CSS training set; and (J–L) are 1-, 3-, and 5-year DCA curves for CSS validation set.

Figure 19. KM survival curve analysis. (A) is age, (B) is sex, (C) is race, (D) is chemotherapy, (E) is distant metastasis, (F) is sequence of surgery and radiotherapy, (G) is degree of tumor invasion, (H) is number of malignancies, (I) is marital status, (J) is “B” symptom, and (K) is radiotherapy.

Figure 20. Online web predictor.

Table 1. Patient clinical characteristics.

Characteristics	Training Set	Validation Set	p-Value
Sex (%)			0.703
Female	574 (68.99)	238 (67.61)
Male	258 (31.01)	114 (32.39)
Race (%)			0.967
American Indian	1 (0.12)	3 (0.85)
Asian	69 (8.29)	39 (11.08)
Black	24 (2.88)	7 (1.99)
White	738 (88.70)	303 (86.08)
Histologic subtypes (%)			<0.001
NOS	89 (10.70)	27 (7.67)
Composite-site Hodgkin’s lymphoma	3 (0.36)	0
Small lymphocytic B cell	6 (0.72)	4 (1.14)
Mantle cell lymphoma	3 (0.36)	3 (0.85)
Mixed diffuse B cell	2 (0.24)	1 (0.28)
DLBCL	455 (54.69)	190 (53.98)
Burkitt’s lymphoma	24 (2.88)	13 (3.69)
Follicular lymphoma	68 (8.17)	36 (10.23)
Marginal zone lymphoma	177 (21.27)	76 (21.59)
T-cell lymphoma	5 (0.60)	2 (0.57)
Pathological grade (%)			0.107
B-cell	823 (99.52)	348 (98.86)
Grade I	3 (0.36)	3 (0.85)
Grade III	1 (0.12)	1 (0.28)
T-cell	5 (0.60)	3 (0.85)
Distant metastasis surgery (%)			0.832
Other regional sites	828 (98.92)	348 (98.86)
Distant lymph node	1 (0.12)	2 (0.57)
Distant site	3 (0.36)	2 (0.57)
Radiation (%)			0.146
Beam radiation	342 (41.11)	144 (40.91)
Method or source not specified	16 (1.92)	2 (0.57)
Recommended	5 (0.60)	2 (0.57)
Refused	469 (56.37)	204 (57.95)
Order of surgery and radiotherapy (%)			0.183
No radiation	643 (77.28)	260 (73.86)
Radiation prior to surgery	188 (22.60)	92 (26.14)
Radiation after surgery	1 (0.12)	0
Chemotherapy record (%)			0.526
Yes	525 (63.10)	221 (62.78)
No	307 (36.90)	131 (37.22)
Systemic “B” symptoms at diagnosis (%)			0.386
Yes	44 (5.29)	13 (3.69)
No	726 (87.26)	314 (89.20)
Not documented	62 (7.45)	25 (7.10)
Marital status (%)			0.272
Single	100 (12.02)	42 (11.93)
Married	657 (78.97)	276 (78.41)
Divorced	75 (9.01)	34 (9.66)
Region(%)			0.013
Counties in metropolitan areas ge 1 million pop	454 (54.57)	182 (51.70)
Counties in metropolitan areas of 250,000 to 1 million pop	184 (22.12)	67 (19.03)
Counties in metropolitan areas of lt 250 thousand pop	64 (7.69)	34 (9.66)
Nonmetropolitan counties adjacent to a metropolitan area	67 (8.05)	38 (10.80)
Nonmetropolitan counties not adjacent to a metropolitan area	63 (7.57)	9 (2.56)
Tumor metastasis (%)			0.017
Localized	472 (56.73)	190 (53.98)
Regional	265 (31.85)	106 (30.11)
Distant	95 (11.42)	46 (13.07)
AJCC stage (%)			0.151
IE	99 (11.90)	45 (12.78)
IEA	321 (38.58)	127 (36.08)
IEB	33 (3.97)	14 (3.98)
II	1 (0.12)	2 (0.57)
IIE	67 (8.05)	24 (6.82)
IIA	10 (1.20)	5 (1.42)
IIEA	173 (20.79)	86 (24.43)
IIEB	33 (3.97)	3 (0.85)
IIIE	1 (0.12)	1 (0.28)
IIIA	3 (0.36)	2 (0.57)
IIIEA	17 (2.04)	7 (1.99)
IIIEB	2 (0.24)	1 (0.28)
IIIES	0	1 (0.28)
IIIESA	2 (0.24)	2 (0.57)
IIIESB	2 (0.24)	0
IIISA	1 (0.12)	0
IV	12 (1.44)	3 (0.85)
IVA	34 (4.09)	21 (5.97)
IVB	21 (2.52)	8 (2.27)
Surgery (%)			0.003
No	376 (45.19)	150 (42.61)
Excision	454 (54.57)	202 (57.39)
Unknown	2 (0.24)	0
Degree of tumor invasion (%)			0.077
I	453 (54.45)	186 (52.84)
II	284 (34.13)	120 (34.09)
III	28 (3.37)	14 (3.98)
IV	67 (8.05)	32 (9.09)
Number of malignant tumors (%)			0.125
1	706 (84.86)	301 (85.51)
2	107 (12.86)	47 (13.35)
3	16 (1.92)	4 (1.14)
4	3 (0.36)	0
Income (%)			0.692
$<$ USD 35,000	14 (1.68)	5 (1.42)
USD 35,000–USD 39,999	19 (2.28)	7 (1.99)
USD 40,000–USD 44,999	34 (4.09)	11 (3.13)
USD 40,000–USD 44,999	35 (4.21)	19 (5.40)
USD 50,000–USD 54,999	59 (7.09)	34 (9.66)
USD 50,000–USD 54,999	69 (8.29)	32 (9.09)
USD 60,000–USD 64,999	77 (9.25)	37 (10.51)
USD 65,000–USD 69,999	148 (17.79)	47 (13.35)
USD 70,000–USD 74,999	77 (9.25)	26 (7.39)
USD 75,000+	300 (36.06)	134 (38.07)
Age	64.819 ± 14.779	64.94 ± 14.564	<0.001

Table 2. Important features screened by different feature selection methods.

Feature Selection Methods	Screening Features
Univariate and Multivariate Cox Regression Analysis	Age, sex, histological type, tumor metastasis, radiotherapy, AJCC staging, surgery, sequence of surgery and radiotherapy, “B” symptoms, degree of tumor invasion, marital status, number of malignancies
Recursive Feature Elimination	Age, sex, histological type, radiotherapy, chemotherapy, “B” symptoms, marital status, region, tumor metastasis, AJCC stage, surgery, degree of tumor invasion, number of malignancies, income
Boruta Feature Selection	Age, histological type, tumor metastasis, AJCC staging, radiotherapy, chemotherapy, “B” symptoms, degree of tumor invasion, number of malignancies

Table 3. SMOTE oversampled modeling results.

	Accuracy	Recall	Precision	F1
SVM	0.532	0.532	0.544	0.486
Logistic Regression	0.691	0.691	0.693	0.691
Random Forest	0.755	0.755	0.766	0.754
GBDT	0.722	0.722	0.722	0.722
XGBoost	0.762	0.769	0.761	0.763

Table 4. ClusterCentroids downsampling modeling results.

	Accuracy	Recall	Precision	F1
SVM	0.593	0.593	0.596	0.590
Logistic Regression	0.701	0.702	0.701	0.706
Random Forest	0.703	0.713	0.707	0.715
GBDT	0.716	0.724	0.714	0.714
XGBoost	0.728	0.721	0.727	0.711

Table 5. SMOTE Tomek link combinatorial sampling modeling results.

	Accuracy	Recall	Precision	F1
SVM	0.563	0.563	0.598	0.529
Logistic Regression	0.675	0.675	0.678	0.675
Random Forest	0.801	0.801	0.801	0.801
GBDT	0.806	0.801	0.817	0.806
XGBoost	0.819	0.816	0.811	0.816

Table 6. The c-index of the Cox-PH model.

	Train	Test
OS	0.751	0.739
CSS	0.790	0.779

Table 7. The c-index of the DeepSurv model.

	Train	Test
OS	0.881	0.758
CSS	0.946	0.789

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Z.; Hu, R.; Chen, J. A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas. Appl. Sci. 2025, 15, 5146. https://doi.org/10.3390/app15095146

AMA Style

Yu Z, Hu R, Chen J. A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas. Applied Sciences. 2025; 15(9):5146. https://doi.org/10.3390/app15095146

Chicago/Turabian Style

Yu, Zihan, Rong Hu, and Jiaqing Chen. 2025. "A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas" Applied Sciences 15, no. 9: 5146. https://doi.org/10.3390/app15095146

APA Style

Yu, Z., Hu, R., & Chen, J. (2025). A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas. Applied Sciences, 15(9), 5146. https://doi.org/10.3390/app15095146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study Investigating Interpretable Deep Learning Models for Predicting Mortality and Survival in Patients with Primary Thyroid Lymphomas

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Preprocessing

2.3. Method of Feature Selection

2.4. The Treatment of Category Imbalance

2.5. Development of Machine Learning Model

2.6. Development of the Cox-PH Model

2.7. Development of the DeepSurv Model

2.8. Model Performance Evaluation

2.9. Statistical Analysis

2.10. Deployment of Online Application Platforms

3. Results

3.1. Patient Characteristics

3.2. Feature Selection

3.3. Comparison of Sampling Techniques and Models

3.3.1. SMOTE Oversampling

3.3.2. ClusterCentroids Downsampling

3.3.3. SMOTE Tomek Link Combinatorial Sampling

3.4. Mortality Risk Prediction Based on SHAP Interpretation Machine Learning

3.4.1. Interpreting the Model Globally

3.4.2. Interpreting the Model Locally

3.5. Performance of the Cox-PH Model

3.6. Performance of the DeepSurv Model

3.7. K–M Survival Analysis Based on the Cox-PH Model

3.8. Web-Based Online Application Platform for Predicting Survival in PTL Patients

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI