Next Article in Journal
Comparative Investigation of the Effects of Adenosine Triphosphate, Melatonin, and Thiamine Pyrophosphate on Amiodarone-Induced Neuropathy and Neuropathic Pain in Male Rats
Previous Article in Journal
piR-hsa-022095 Drives Hypertrophic Scar Formation via KLF11-Dependent Fibroblast Proliferation
Previous Article in Special Issue
Photobiomodulation Therapy Reduces Oxidative Stress and Inflammation to Alleviate the Cardiotoxic Effects of Doxorubicin in Human Stem Cell-Derived Ventricular Cardiomyocytes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Explaining Risk Stratification in Differentiated Thyroid Cancer Using SHAP and Machine Learning Approaches

by
Mallika Khwanmuang
1,
Watcharaporn Cholamjiak
2 and
Pasa Sukson
1,*
1
School of Medicine, University of Phayao, Phayao 56000, Thailand
2
School of Science, University of Phayao, Phayao 56000, Thailand
*
Author to whom correspondence should be addressed.
Biomedicines 2025, 13(12), 2964; https://doi.org/10.3390/biomedicines13122964
Submission received: 24 October 2025 / Revised: 20 November 2025 / Accepted: 25 November 2025 / Published: 2 December 2025
(This article belongs to the Special Issue Pathological Biomarkers in Precision Medicine)

Abstract

Background/Objectives: Differentiated thyroid cancer (DTC) represents over 90% of all hyroid malignancies and typically has a favorable prognosis. However, approximately 30% of patients experience recurrence within 10 years after initial treatment. Conventional risk classification frameworks such as the American Thyroid Association (ATA) and AJCC TNM systems rely heavily on pathological interpretation, which may introduce observer variability and incomplete documentation. This study aimed to develop an interpretable machine-learning framework for risk stratification in DTC and to identify major clinical predictors using SHapley Additive exPlanations (SHAP). Methods: A retrospective dataset of 345 patients was obtained from the UCI Machine Learning Repository. Thirteen clinicopathological features were analyzed, including Age, Gender, T, N, M, Hx Radiotherapy, Focality, Adenopathy, Pathology, and Response. Statistical analysis and feature selection (ReliefF and mRMR) were applied to identify the most influential variables. Two modeling scenarios were tested using an optimizable neural network classifier: (1) all 10 core features and (2) reduced features selected from machine learning criteria. SHAP analysis was used to explain model predictions and determine feature impact for each risk category. Results: Reducing the input features from 10 to 6 led to improved performance in the explainable neural network model (AUC = 0.94, accuracy = 92%), confirming that T, N, Response, Age, M, and Hx Radiotherapy were the most informative predictors. SHAP analysis highlighted N and T as the dominant drivers of high-risk classification, while Response enhanced postoperative biological interpretation. Notably, when Response was excluded (Scenario III), the optimizable tree model still achieved strong predictive performance (AUC = 0.93–0.96), demonstrating that accurate preoperative risk estimation can be achieved using only clinical baseline features. Conclusions: The proposed interpretable neural network model effectively stratifies recurrence risk in DTC while reducing dependence on subjective pathological interpretation. SHAP-based feature attribution enhances clinical transparency, supporting integration of explainable machine learning into thyroid cancer follow-up and personalized management.

1. Introduction

Differentiated thyroid cancer (DTC)—comprising papillary (PTC) and follicular (FTC) carcinoma—accounts for the vast majority of thyroid malignancies worldwide [1,2,3]. Although overall disease-specific survival is excellent, a clinically meaningful subset of patients develops persistent or recurrent disease during follow-up, with reported 10-year recurrence proportions ranging from ~15% to nearly 30% depending on cohort composition, disease stage, and surveillance intensity [1,4,5,6]. PTC is the predominant subtype; FTC represents ~10–15% of cases and is distinguished by a higher propensity for hematogenous spread due to vascular invasion, which underpins its metastatic behavior to lung and bone [2,7,8].
Thyroid biology provides additional context for risk behavior. The gland synthesizes thyroxine (T4) and triiodothyronine (T3), under control of thyroid-stimulating hormone (TSH). While TSH stimulation has long been implicated in thyroid tumorigenesis, emerging data suggest that higher circulating free T4 (FT4), altered FT4/FT3 ratios, and hyperthyroxinemia may also associate with malignant transformation and adverse phenotypes in DTC [9,10,11]. These endocrine signals intersect with the established clinicopathologic determinants of recurrence—tumor size and extrathyroidal extension, lymph-node metastasis, distant spread, histologic subtype, vascular/lymphovascular invasion, and multifocality [5,7,12,13,14,15]. Therapeutic strategies differ markedly between well-differentiated and dedifferentiated or otherwise aggressive thyroid cancer subtypes. While management of DTC typically relies on surgery followed by radioiodine therapy, dedifferentiated tumors—characterized by loss of iodine-avid function—often require systemic molecular-targeted agents, particularly tyrosine kinase inhibitors. Recognizing these distinctions highlights the clinical importance of accurately identifying key prognostic predictors in DTC [16]. Furthermore, because a subset of patients with DTC develop disease recurrence, early recognition of recurrence-associated risk factors remains essential. Recent evidence [17] indicates that intraglandular dissemination, tumor size, bilateral cervical lymph node involvement, and coexisting Hashimoto’s thyroiditis represent major risk determinants, underscoring the need for more refined and individualized risk-stratification approaches.
Two complementary frameworks anchor contemporary practice. The AJCC TNM system stratifies anatomic extent to inform prognosis and survival counseling [18], while the American Thyroid Association (ATA) Risk Stratification System integrates postoperative clinicopathologic variables to classify recurrence risk as low, intermediate, or high and to guide decisions regarding radioactive iodine (RAI), thyrotropin suppression, and follow-up intensity [5,19]. Dynamic risk stratification refines these categories over time by incorporating treatment response (biochemical and structural) during surveillance [20,21].
Despite their clinical utility, TNM and ATA risk frameworks leave important needs unmet:
  • False-negative risk assignment. Even patients initially categorized as ATA low risk can recur during long-term follow-up; intermediate-risk misclassification remains non-trivial in several series, reflecting underestimation of true recurrence probability [4,6,20].
  • Observer variability. Key histologic features—especially vascular/lymphovascular invasion and minimal extrathyroidal extension—show modest interobserver agreement, with discordance reported in multi-institutional studies, contributing to inconsistent risk assignment [13,14,15,22].
  • Incomplete documentation. Missing or inconsistent ATA reporting, variable quantification of nodal burden, and heterogeneous capture of extrathyroidal extension complicate cross-institutional comparisons and registry-based research [3,5].
  • Static, siloed data. Classical schemes rely on static postoperative pathology and rarely integrate longitudinal labs, imaging trajectories, and narrative reports in a unified, quantitative manner [20,23].
  • Limited interpretability of newer tools. While AI models can outperform traditional schemes on discrimination metrics, their “black-box” nature and lack of transparent, case-level explanations hinder clinical adoption [24,25,26].
Recent studies have applied machine learning (ML)—including support vector machines, random forests, gradient boosting, decision trees, artificial neural networks (ANNs), and CatBoost—to improve recurrence prediction and dynamic risk reassessment, achieving high sensitivities and AUROC values in single-center cohorts [24,27,28,29,30,31,32]. Typical predictors include age and sex; TNM components; multifocality; intraglandular dissemination; Hashimoto’s thyroiditis; smoking and prior radiotherapy; and early treatment response [24,28,29,30,31]. However, dataset heterogeneity, site-specific practice patterns, and limited generalizability remain concerns; crucially, clinicians require transparent explanations that map model outputs back to familiar risk factors.
SHAP (SHapley Additive exPlanations) offers a principled, model-agnostic solution by decomposing a prediction into additive contributions from each feature based on cooperative game theory [33,34]. SHAP can (i) rank global drivers of risk and (ii) provide patient-level attributions that reveal how, for example, nodal status, tumor extent, and treatment response jointly alter predicted risk. This is especially relevant because risk classification itself is not directly measurable: it depends on expert synthesis of multiple variables—capsular and vascular invasion, extrathyroidal extension, multifocality, and histologic subtype—whose accurate assessment often requires highly experienced pathologists. Borderline cases (e.g., minimal extrathyroidal extension or focal vascular invasion) are susceptible to inter-observer variability, introducing latent label noise that may degrade model calibration if unaddressed [13,14,15,22]. By learning reproducible multi-feature patterns and pairing them with SHAP explanations, ML models can complement clinician judgment and mitigate subjective variability in risk assessment.
Building on these gaps, we develop and explain ML models for risk stratification in differentiated thyroid cancer using a curated set of clinicopathologic features (including treatment response when appropriate) and quantify feature contributions with SHAP. We (i) compare feature-selection strategies, (ii) evaluate multiple classifiers against standard metrics, and (iii) use SHAP to align model behavior with clinical expectations around TNM burden, response, and vascular invasion—thereby delivering an interpretable, clinically consonant framework for recurrence risk assessment.

2. Materials and Methods

2.1. Dataset Description

The dataset employed in this study is the Differentiated Thyroid Cancer Recurrence dataset obtained from the UCI Machine Learning Repository [35]. It comprises 383 patients with well-differentiated thyroid carcinoma who were followed for at least 10 years across a 15-year study period. From the initial dataset containing 16 clinical and pathological features, a process of Clinical Review and Feature Exclusion was conducted in collaboration with medical experts. Based on clinical redundancy and direct derivation relationships (e.g., Stage derived from T, N, M and Recurred dependent on Risk), only 13 features were retained for model development and analysis. A summary of the dataset characteristics is provided in Table 1.
Although the dataset includes both TNM staging and risk stratification variables, the Stage variable was excluded from analysis because it is a deterministic composite derived directly from T, N, and M. Including both would introduce redundancy and multicollinearity without additional predictive value. Similarly, the Recurred variable—although it represents the primary outcome in the original dataset—was not used as the prediction target in this study. This decision was made because Risk stratification inherently encapsulates recurrence probability, being the clinical construct used by oncologists and pathologists to anticipate recurrence before it occurs. In other words, recurrence is a manifestation, whereas the Risk score is an assessment integrating pathological findings and clinical interpretation. Therefore, our machine-learning models were trained to predict and explain the Risk category, which is the clinically meaningful decision layer in thyroid cancer management. From a translational perspective, accurate prediction and explanation of Risk contribute more directly to patient stratification and personalized therapy planning than mere recurrence classification.

2.2. Methods—Data Encoding

All categorical variables were encoded to numeric values prior to modeling, see in Table 2, Table 3 and Table 4. Spelling inconsistencies in the raw file (e.g., “Hx Radiothreapy”) were harmonized to Hx Radiotherapy for reporting, while preserving the original column name in the data dictionary for traceability. We predicted Risk (Low/Intermediate/High) as the primary target. The following mapping was fixed a priori and applied consistently across all splits to prevent data leakage. Values not present in the training set were rejected at preprocessing time to avoid silent mis-mapping.

2.3. Research Workflow

The research aimed to develop an interpretable machine-learning framework for risk stratification in differentiated thyroid cancer (DTC) using clinical, pathological, and treatment response data. The process consisted of five main stages, integrating statistical analysis, machine learning, and SHAP-based interpretability to bridge data-driven prediction with clinical reasoning.
Figure 1 shows that the process begins with data acquisition from the UCI public dataset (17 features, 383 patients), followed by Clinical Review and Feature Exclusion to retain 13 clinically relevant variables. After data preprocessing and encoding of binary, ordinal, and multi-category variables, statistical analysis and chi-square testing were performed to assess feature associations. Two modeling scenarios were designed: Scenario I, using 10 clinical features, and Scenario II, using 6 optimized features selected via ReliefF and mRMR algorithms. Model training and validation employed 5-fold cross-validation across nine machine-learning models (Tree, Discriminant, ELR, NB, SVM, Efficient Linear, KNN, Kernel, and Neural Network). The best-performing model was then interpreted using SHAP analysis, enabling feature-level explainability. Finally, the outcomes were clinically validated and mapped to personalized risk assessment for improved post-treatment management of DTC patients.

3. Results

3.1. Results of Statistical Analysis

The statistical analysis was conducted using ordinal logistic regression (OLR) to identify significant predictors of risk stratification in differentiated thyroid cancer. The dependent variable was Risk (ordered as Low → Intermediate → High). The full model incorporated all clinicopathological variables, and subsequent adjustments were performed by merging certain T subcategories to address class imbalance among tumor stages.
From Table 5, the omnibus likelihood ratio tests (LRT) identified T-stage, N-stage, and M-stage as the most influential factors associated with increasing Risk levels (p < 0.001 for all three). The Response to initial therapy (p ≈ 0.04), Hx Radiotherapy (p ≈ 0.027), and Age (p ≈ 0.014) also contributed significantly. Other variables, including Gender, Smoking history, Focality, Pathology, and Adenopathy, were not statistically significant (p > 0.05). The ranking of effect importance based on χ2 values indicated a clear hierarchy: T > N > M > Response > Hx Radiotherapy > Age. Initial analysis revealed that T-stage contained multiple sublevels (T1a–T4b), with some categories having very small sample sizes, leading to unstable odds ratio (OR) estimates. Therefore, clinically related subgroups were merged (e.g., T4a + T4b → T4, and T1 + T2 → T1_2). After adjustment, the model maintained comparable fit (χ2 = 447, p < 0.001; McFadden R2 = 0.695), confirming the robustness of the predictors.
After adjusting for covariates:
  • Age (OR = 1.03, p ≈ 0.02) modestly increased the likelihood of higher risk with advancing age.
  • Hx Radiotherapy had a strong positive association with higher risk (OR ≈ 76–96, p ≈ 0.01–0.03).
  • Metastasis (M1) markedly increased risk (OR ≈ 90–113, p < 0.001).
  • N1b nodes conferred the greatest effect (OR ≈ 86–105, p < 0.001), while N1a also remained significant (OR ≈ 25–33, p < 0.001).
  • Response categories also showed graded associations: compared with Excellent, Structural Incomplete (OR ≈ 3.9, p ≈ 0.03) and Biochemical Incomplete (OR ≈ 4.8–5.7, p ≈ 0.02–0.03) were linked with elevated risk.
Non-significant variables included Gender, Smoking, Adenopathy, Pathology, and Focality (all p > 0.1).
The estimated thresholds separating Low → Intermediate (β ≈ 5.7, p < 0.001) and Intermediate → High (β ≈ 13.5, p < 0.001) were statistically significant, demonstrating distinct boundaries between risk categories.
Clinically, the dominance of T, N, and M as predictors reflects the underlying TNM staging logic, which inherently defines tumor progression and spread. However, the ordinal regression analysis quantifies their relative contribution and confirms that T-stage exerts the strongest incremental effect on risk escalation. The strong effect of Hx Radiotherapy likely reflects referral bias in higher-risk patients previously exposed to radiation. The Response variable captures postoperative biological behavior and indicates that biological aggressiveness after surgery remains a strong determinant of risk classification.

3.2. Feature Selection Using Machine Learning Approaches

To enhance model interpretability and generalizability, feature selection, predictive modeling, and model explainability were performed using machine learning workflows implemented in MATLAB’s Classification Learner, including SHAP-based analysis to quantify feature contribution. While statistical analysis (Section 3.1) identified significant variables through inferential testing, machine learning (ML)-based feature selection enables multivariate and non-linear evaluation of variable importance, which is particularly useful for complex clinical data such as thyroid cancer risk stratification.
This dual approach ensures that both statistical significance and predictive relevance are systematically assessed before model development.
Two supervised ML algorithms were employed to rank and select influential predictors:
  • ReliefF algorithm [36]—a neighborhood-based feature ranking technique that evaluates each variable’s ability to differentiate between neighboring instances belonging to different risk categories. It captures non-linear and interaction effects among variables without assuming any specific data distribution.
  • MRMR (Minimum Redundancy Maximum Relevance) [37]—an information-theoretic algorithm that identifies features exhibiting the highest relevance to the target variable (Risk) while minimizing redundancy among correlated predictors. This approach enhances model efficiency and reduces collinearity between features such as T, N, and M.
Both algorithms were applied to the encoded dataset (excluding Stage due to redundancy with TNM variables). The computed importance scores from ReliefF and MRMR were compared to identify consistent patterns and complementary insights. The resulting rankings and visualized feature importance profiles are presented in Figure 2 and Table 6, respectively.
Both machine learning algorithms identified T-stage as the most dominant predictor of thyroid cancer risk stratification. The ReliefF algorithm emphasized overall stage (T > Stage > M > Response > N > Age), focusing on neighborhood-based differentiation, whereas MRMR prioritized features with high information gain and minimal overlap (T > Hx Radiotherapy > N > Response > Focality). These complementary results confirm the clinical hierarchy of T > N > M, and highlight the influence of prior radiotherapy and postoperative response, which will be further examined through SHAP-based explainability analyses in the following section.

3.3. Machine Learning Model Development

After identifying the statistically and clinically relevant variables, machine learning (ML) model development was performed to predict the categorical risk levels (Low, Intermediate, High) of differentiated thyroid cancer. The primary goal of this phase was to construct a robust and interpretable predictive model that integrates both statistical inference and data-driven learning to improve risk stratification beyond conventional ordinal regression. Prior to training, feature selection was finalized by integrating results from the statistical analysis (Ordinal Logistic Regression) and ML-based feature ranking algorithms (ReliefF and MRMR). This hybrid selection strategy ensured that only the most informative, non-redundant, and clinically meaningful predictors were included, while variables that could cause redundancy (Stage) or post-treatment leakage (Response) were excluded. The resulting feature set represents the optimal compromise between model simplicity, interpretability, and predictive strength.
The final selected features used for model training are summarized in Table 7. These variables were used to train multiple supervised learning algorithms in subsequent steps, including logistic regression, random forest, and gradient boosting, with hyperparameter tuning and five-fold cross-validation to ensure model generalizability.
According to the integrated evidence from both statistical and machine learning analyses, two distinct feature selection scenarios were defined for subsequent model training:
Scenario I—Extended Feature Set (Include + Optional): This configuration expands upon the core set by including additional variables that demonstrated moderate importance or complementary predictive potential in machine learning ranking. The extended feature set includes T, N, M, Age, Hx Radiotherapy, Response, Adenopathy, Pathology, and Gender.
Scenario II—Core Feature Set (Include only): This configuration contains only the variables confirmed as statistically and clinically significant in both analytical approaches. The selected features are T, N, M, Age, Hx Radiotherapy, and Response. These features represent the essential predictors for baseline risk modeling and provide the foundation for developing the main predictive framework.
Scenario III—Reduced Preoperative Feature Set (Include—Response): This scenario excludes the postoperative Response variable to focus strictly on preoperative predictors. The remaining five features—T, N, M, Age, and Hx Radiotherapy—represent the key variables that retain strong statistical and machine-learning support. This configuration evaluates the model’s ability to perform true preoperative risk stratification using only baseline information.
To ensure methodological robustness and reduce overfitting risk, the dataset was divided into three parts: (1) a training set with 5-fold cross-validation for model learning, (2) a validation set for hyperparameter optimization, and (3) a hold-out independent test set comprising 10% of the data for unbiased performance evaluation.
From Table 8, reducing the feature set from ten to six variables (Scenario II) improved computational efficiency while maintaining comparable or enhanced predictive performance across most classifiers. The Neural Network demonstrated the highest performance in Scenarios I and II, with accuracy increasing from 89.47% to 92.11% and only a modest rise in macro F1 score. In Scenario III, where the postoperative Response variable was excluded, the Tree model emerged as the best-performing and most practically applicable classifier, achieving 88.12% accuracy, 92.11% precision, 92.86% recall, and 87.88% macro F1, alongside a strong AUC (~0.93). This indicates that the five strictly preoperative predictors (T, N, M, Age, and Hx Radiotherapy) retain sufficient discriminatory power for robust early risk stratification. The consistency and interpretability of the Tree model further support its suitability for real-world clinical deployment. Overall, the findings show that progressive feature reduction enhances parsimony and preserves predictive accuracy, with the Tree model offering the most effective balance of performance, simplicity, and clinical usability in the preoperative setting.
The optimizable neural network model (Figure 3) demonstrated strong classification performance, with minimal misclassification and high discriminative ability (AUCs: 0.85–0.96; micro-AUC: 0.94), effectively separating all three DTC risk groups. Similarly, the optimizable tree model in Scenario III (Figure 4) showed consistently high performance using only preoperative predictors, with class-wise AUCs of 0.93–0.96 and accurate identification across all classes. Together, these results confirm the robustness of both models, with the tree model offering a highly effective and clinically practical solution in the reduced preoperative feature setting.

3.4. Model Explainability Using SHAP Analysis

After identifying the neural network as the best-performing model in Scenario II (Table 8), SHAP (SHapley Additive exPlanations) analysis was employed to interpret how individual predictors contributed to the model’s output probabilities across the three thyroid cancer risk categories (Low, Intermediate, and High). The SHAP framework provides a unified measure of feature contribution by quantifying each variable’s marginal impact on model predictions, thereby enhancing model transparency and clinical interpretability.
By computing the mean absolute Shapley values, we can determine which features exert the greatest influence on the neural network’s decision boundary. This approach allows clinicians and researchers to confirm whether the model’s learned relationships are consistent with established clinical reasoning and pathological patterns observed in differentiated thyroid cancer.
Figure 5 demonstrates that the neural network model in Scenario II relied primarily on N and T, with N showing the highest Shapley values and representing the dominant driver of prediction, followed by substantial contributions from T and the postoperative Response variable. Age had a moderate effect, while M and Hx Radiotherapy added minimal incremental value once nodal and local tumor burden were accounted for.
Figure 6 shows a similar pattern in Scenario III (excluding Response), where the Tree model again identified N as the strongest predictor, followed by T and Age, with M and Hx Radiotherapy contributing only marginally. This consistency across models reinforces the central role of nodal and tumor extent parameters in preoperative risk discrimination.
The class-wise SHAP plots from the neural network (Figure 7) and tree model (Figure 8) consistently reveal clear and clinically coherent patterns of feature influence across the three risk classes.
  • Class 0 (Low Risk).
Low-risk predictions are characterized by low T and N values with small or negative SHAP contributions, indicating minimal tumor extension and absence of nodal metastasis. Response and Age show minimal influence, aligning with the favorable biologic profile of this group.
  • Class 1 (Intermediate Risk).
Intermediate-risk cases exhibit moderate positive SHAP shifts primarily from N and T, showing that even limited regional spread or modest tumor growth increases risk probability. Response provides additional separation—non-excellent responses push predictions upward, while excellent responses shift them downward.
  • Class 2 (High Risk).
High-risk predictions are dominated by high N and T values with strong positive SHAP effects, reflecting aggressive tumor behavior and regional metastasis. Response again plays a major role, with non-excellent outcomes yielding the highest SHAP values. M contributes intermittently in line with metastatic potential, while Hx Radiotherapy produces minimal impact in both models.
Across both modeling approaches, T and N emerge as the primary determinants of class separation, with Response providing dynamic post-treatment refinement and Age and M contributing modestly. These SHAP patterns affirm that the models differentiate risk categories in a clinically meaningful and guideline-consistent manner.
From Figure 7, Figure 8, Figure 9 and Figure 10, the class-wise SHAP box summary plots consistently demonstrate distinctive feature influence patterns across thyroid cancer risk levels in both the neural network and tree models.
  • Class 0 (Low Risk)
SHAP values for most predictors are centered near zero with minimal spread, indicating limited contribution to risk elevation. Small variation in T and N confirms the generally indolent phenotype—small tumors without lymph node involvement. Low and stable SHAP values for Response further reflect favorable postoperative outcomes.
  • Class 1 (Intermediate Risk)
Wider interquartile ranges for N and T highlight their stronger and more variable impact in this transitional group. N remains the dominant driver, while T provides additional local invasion signal. Moderate contributions from Response and Age suggest their roles in refining borderline risk assignments—especially differentiating between good vs. suboptimal treatment response patients.
  • Class 2 (High Risk)
Positive SHAP distributions for N and T are the most pronounced, confirming extensive nodal burden and aggressive tumor growth as key determinants of high-risk status. Response exhibits consistently positive contributions, indicating that incomplete or poor treatment response strongly increases recurrence probability. Occasional influence from M aligns with advanced metastatic cases. Hx Radiotherapy remains negligible, supporting its role as a secondary indicator in both models.
Across all visualizations, T and N are the most decisive predictors of risk escalation, while Response acts as an important behavioral marker post-treatment. These class-specific SHAP patterns reinforce the clinical reliability and interpretability of the proposed models in stratifying differentiated thyroid cancer risk.

4. Discussion

This study developed an interpretable machine-learning framework for risk stratification in differentiated thyroid cancer (DTC), integrating clinical, pathological, and postoperative response variables. By combining neural network modeling with SHAP-based explainability, our approach provides both predictive accuracy and clinical interpretability—bridging a critical gap between statistical prediction and individualized patient management.
The model identified six key predictors—T, N, Response, Age, M, and Hx Radiotherapy—as the most influential determinants of recurrence risk. Among these, N and T exhibited the strongest positive SHAP contributions to high-risk classification, consistent with their established roles in tumor burden and local invasion. Cervical lymph node metastasis (N1b) remains the most important prognostic determinant in papillary thyroid carcinoma, directly influencing the extent of surgery, need for adjuvant radioactive iodine (RAI), and intensity of postoperative surveillance. Similarly, T classification reflects extrathyroidal extension and tumor size, aligning closely with the ATA and AJCC risk frameworks. Interestingly, the inclusion of Response as a dynamic postoperative variable substantially enhanced model precision. In clinical practice, treatment response integrates both biochemical (e.g., thyroglobulin trends) and structural evidence from imaging; thus, it serves as a real-time reflection of disease behavior beyond static pathology. The strong model contribution of Response indicates that incorporating longitudinal follow-up data can recalibrate initial risk assessment—echoing the principles of dynamic risk stratification advocated by Tuttle et al. [19]. The finding that Hx Radiotherapy retained positive SHAP influence for higher-risk patients likely reflects referral bias: individuals previously exposed to head-and-neck radiation are more frequently managed in tertiary centers and often present with more aggressive disease phenotypes. This observation emphasizes that background treatment history remains an essential modifier of risk interpretation. The “Risk” labels used in this study are derived from the evidence-supported 2015 and 2025 ATA Risk of Recurrence stratification systems for differentiated thyroid carcinoma (DTC), rather than subjective expert judgment. These ATA frameworks integrate objectively validated clinicopathologic and molecular predictors—including tumor focality, extrathyroidal extension, vascular and angioinvasion, lymph node burden (size, number, LNR), AJCC staging, and treatment response variables. Large-scale cohort studies and meta-analyses have consistently demonstrated the predictive validity of these risk categories; for example, the 2015 ATA system documented recurrence rates of approximately 1.5% (low risk), 5.4% (intermediate risk), and 25% (high risk), with similar trends observed in T1a PTC and other DTC subtypes. These findings reinforce that the risk strata employed in this work reflect clinically validated constructs rather than arbitrary or subjective classification. Nevertheless, because ATA-based categories represent proxy indicators of recurrence rather than longitudinally observed outcomes, some degree of latent label noise may still be present—particularly in retrospective datasets. To mitigate this, the present study employed rigorous internal validation, including five-fold cross-validation, a dedicated validation subset, and a 10% independent hold-out test set to enhance generalization and minimize overfitting. Future work incorporating datasets with real longitudinal recurrence outcomes will further improve calibration and strengthen clinical applicability.
Furthermore, in Scenario III—where the postoperative Response variable was removed to reflect a strictly preoperative prediction setting—the Optimizable Tree model continued to demonstrate excellent performance, with accuracy and AUC values comparable to those observed in Scenario II. This finding indicates that robust recurrence-risk prediction in differentiated thyroid cancer can still be achieved without relying on postoperative data. Importantly, by using only preoperative clinical features (T, N, M, Age, and Hx Radiotherapy), the model enables actionable risk estimation before initial treatment decisions are made, supporting better surgical planning, early evaluation of adjuvant therapy needs, and individualized patient counseling. These results reinforce the model’s clinical utility and align with the goals of precision oncology, promoting proactive and data-driven personalized care even in the absence of longitudinal follow-up information.
From a clinical perspective, the explainable machine-learning framework complements traditional staging systems by offering individualized probability estimates rather than categorical labels. For example, patients with intermediate ATA risk but high SHAP-derived contributions from Response or N may warrant closer monitoring or early RAI ablation, even if they fall within a conservative management protocol. Conversely, those with low SHAP scores across all predictors may safely undergo de-escalated surveillance, thereby reducing unnecessary imaging and cost. The interpretability of SHAP outputs provides a major advantage for multidisciplinary tumor boards, enabling endocrinologists, surgeons, and nuclear medicine specialists to visualize how each factor contributes to recurrence probability. This transparency enhances clinical trust and supports shared decision-making, aligning computational outputs with pathophysiological reasoning. Moreover, explainable AI may reduce interobserver variability by offering a reproducible, data-driven supplement to human interpretation—particularly valuable in borderline cases such as minimal vascular invasion or focal extrathyroidal extension, where histopathologic consensus is often limited.
In the context of personalized medicine, the integration of explainable ML models allows risk prediction to evolve from population-based guidelines toward individualized treatment trajectories. Rather than classifying patients as low, intermediate, or high risk in static categories, SHAP-informed models generate a continuous risk spectrum that can adapt as new clinical or biochemical information emerges. This facilitates truly personalized surveillance—where follow-up frequency, imaging modality, and therapeutic intensity are dynamically matched to each patient’s evolving risk profile. Furthermore, the interpretability of the SHAP framework offers a mechanism for model validation across diverse patient cohorts and institutions. By comparing SHAP feature rankings in external datasets, researchers can assess whether predictive drivers remain stable or vary across populations—an essential step for equitable implementation of AI tools in precision oncology.
The proposed model underscores a paradigm shift from descriptive staging toward explainable, adaptive modeling that aligns with the principles of learning health systems. In clinical practice, such tools could be embedded within electronic medical records to provide real-time, interpretable risk updates as patient data accumulate. Future studies should aim to integrate molecular biomarkers (e.g., BRAF, RET mutations), radiomic features, and temporal biochemical trends to refine prediction granularity further. Ultimately, explainable AI in thyroid oncology holds promise not only for improving recurrence risk prediction but also for reinforcing clinical reasoning, promoting transparency, and advancing personalized, data-driven care for patients with differentiated thyroid cancer.
Although the proposed explainable machine-learning framework demonstrates strong predictive power and clear clinical interpretability, several limitations should be acknowledged. First, the dataset used in this study was derived from a single public source (UCI Differentiated Thyroid Cancer Recurrence dataset) with a relatively limited sample size. While the inclusion of key clinicopathologic features provides a robust foundation, certain parameters such as biochemical markers (e.g., thyroglobulin, anti-Tg antibodies), detailed histopathological subvariants, and molecular markers (e.g., BRAF, RET, RAS) were unavailable. These molecular determinants are increasingly recognized as major contributors to disease aggressiveness and may further enhance the precision of recurrence prediction if integrated into future models.
Second, the “Risk” label itself is an expert-derived variable rather than a directly measurable endpoint. This dependency introduces potential interobserver variability, particularly in the assessment of vascular invasion, extrathyroidal extension, and capsular infiltration. As a result, some degree of label uncertainty may propagate into the model, potentially affecting calibration. Incorporating probabilistic or fuzzy labeling frameworks may mitigate this issue by quantifying uncertainty at the input level.
Third, the present analysis employed a retrospective dataset without external validation. Although internal validation demonstrated high accuracy and AUC, prospective multicenter validation using real-world hospital registries is required to confirm generalizability across ethnicities, clinical settings, and diagnostic equipment. Cross-institutional collaborations and federated learning approaches could address this limitation while maintaining patient data privacy.
Future work should prioritize external validation using real-world hospital datasets to ensure model generalizability and strengthen clinical credibility. This study provides a foundation for developing an effective data acquisition strategy to systematically collect high-quality clinical features required for model deployment. Establishing robust external validation pipelines will support evidence-based integration of predictive models into routine practice and help guarantee that performance remains reliable across diverse patient populations and clinical settings.
Ultimately, explainable machine-learning tools like SHAP can transform endocrine oncology from a static, guideline-based discipline into a dynamic, learning system that continuously improves as more patient data become available. Such integration between artificial intelligence and clinical reasoning represents a critical step toward transparent, equitable, and patient-centered precision medicine in differentiated thyroid cancer.
Despite these limitations, this work provides several notable contributions to the following fields:
(1) It is among the first studies to systematically compare predictive performance under both postoperative and purely preoperative scenarios in DTC risk stratification, demonstrating that accurate prediction can be achieved even without Response data.
(2) The integration of SHAP-based explainability offers interpretable clinical insights that align with standard risk frameworks (ATA/AJCC), supporting transparent AI-assisted decision-making.
(3) The study establishes a practical modeling pipeline using widely accessible software (MATLAB Classification Learner (VR2025a)), enabling straightforward clinical translation and reproducibility in healthcare environments with limited computational resources.
(4) The findings help inform future data collection strategies for external validation and multimodal model enhancement, laying groundwork for deployment within real-world clinical workflows.
In this context, the emergence of precision oncology provides an important conceptual foundation for integrating individualized clinical, pathological, and treatment-related information into risk-adapted management strategies for DTC. Recent advances in artificial intelligence further support this paradigm by enabling data-driven, patient-specific prediction models that can complement traditional staging frameworks and enhance treatment precision. Accordingly, explainable machine-learning approaches—such as the framework proposed in this study—may serve as a bridge between statistical prediction and personalized clinical decision-making, reinforcing the shift toward more refined and patient-centered risk stratification.

5. Conclusions

This study demonstrates the clinical utility of two complementary explainable machine-learning approaches for recurrence risk stratification in differentiated thyroid cancer (DTC). The interpretable neural network model incorporating both static pathological indicators (T, N, M, Hx Radiotherapy) and dynamic postoperative variables (Response, Age) achieved the highest predictive performance, closely aligning with expert clinical reasoning. In parallel, the optimizable tree model—designed without the Response variable to reflect a purely preoperative prediction scenario—retained strong classification capability, confirming that reliable risk estimation can still be achieved prior to postoperative follow-up. Together, these models highlight flexible and clinically coherent strategies for precision risk assessment across the entire continuum of care.
Clinically, this approach bridges the gap between traditional risk classification systems (ATA, AJCC) and modern precision oncology. Rather than relying on categorical staging alone, SHAP-based interpretation enables patient-level visualization of how each clinical variable contributes to recurrence probability—facilitating informed, personalized management decisions. Such transparency supports multidisciplinary decision-making, promotes physician trust in AI systems, and allows continuous refinement of follow-up protocols according to evolving risk patterns.
From a broader perspective, this framework illustrates the potential of explainable artificial intelligence (XAI) to transform endocrine oncology into a learning health system, where every clinical encounter contributes to iterative model improvement. Future extensions that incorporate molecular, radiomic, and longitudinal biochemical data may further enhance individualized prediction and treatment precision. Ultimately, integrating interpretable ML tools into clinical workflows represents a decisive step toward data-driven, transparent, and personalized care in thyroid cancer management.

Author Contributions

Conceptualization, M.K.; methodology and software, W.C.; validation, P.S., W.C. and M.K.; formal analysis, W.C.; data curation, M.K.; writing—original draft preparation, W.C.; writing—review and editing, P.S. and M.K.; supervision, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

The University of Phayao and Thailand Science Research and Innovation Fund (Fundamental Fund 2026, Grant No. 2253/2568), and the revenue budget in 2025, School of Medicine, University of Phayao (Grant No. MD68-21).

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki, the Belmont report, CIOMS Guideline international conference on Harmonization in Good Cinical Pratice or ICH-GCP and with approval from the Human Research Ethics Committee of University of Phayao on Health Sciences and Sciences and Technology (Institutional Review Board (IRB) approval, IRB Number: HREC-UP-HSST 1.1/003/69, Approval Date: 15 October 2025).

Informed Consent Statement

Patient consent was waived because the data used in this study were obtained from an open-access, fully anonymized dataset—the Differentiated Thyroid Cancer Recurrence Dataset—available through the UCI Machine Learning Repository (DOI: https://doi.org/10.24432/C5632J). The dataset contains no identifiable personal information; therefore, individual informed consent was not required. This study was conducted in accordance with the Declaration of Helsinki, the Belmont Report, the CIOMS Guidelines, and the International Conference on Harmonization–Good Clinical Practice (ICH-GCP). The study protocol was reviewed and approved by the Ethics Committee and Institutional Review Board of Phayao Hospital, Thailand (IRB Approval Number: HREC-UP-HSST 1.1/003/69), under the Exempt review category.

Data Availability Statement

Differentiated Thyroid Cancer Recurrence dataset from UCI is available on the UCI website (https://doi.org/10.24432/C5632J).

Acknowledgments

P. Suksorn and M. Khwanmuang would like to thank the revenue budget in 2025, School of Medicine, University of Phayao (Grant No. MD68-21). W. Cholamjiak would like to thank University of Phayao and Thailand Science Research and Innovation Fund (Fundamental Fund 2026, Grant No. 2253/2568). During the preparation of this manuscript, the authors used ChatGPT (GPT-5, OpenAI, 2025) for language refinement, grammar correction, and improvement of academic phrasing in the Abstract, Methods, and Discussion sections. The tool was used solely to enhance clarity and readability; it did not generate scientific ideas, perform data analysis, or contribute to the interpretation of results. All scientific content, clinical interpretation, and conclusions were developed, verified, and approved by the authors, who take full responsibility for the integrity and accuracy of the work. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shokoohi, A.; Berthelet, E.; Gill, S.; Prisman, E.; Sexsmith, G.; Tran, E.; White, A.; Wiseman, S.M.; Wu, J.; Ho, C. Treatment for Recurrent Differentiated Thyroid Cancer: A Canadian Population-Based Experience. Cureus 2020, 12, e7122. [Google Scholar] [CrossRef]
  2. Coca-Pelaz, A.; Rodrigo, J.P.; Shah, J.P.; Nixon, I.J.; Hartl, D.M.; Robbins, K.T.; Kowalski, L.P.; Mäkitie, A.A.; Hamoir, M.; López, F.; et al. Recurrent Differentiated Thyroid Cancer: The Current Treatment Options. Cancers 2023, 15, 2692. [Google Scholar] [CrossRef]
  3. Cabanillas, M.E.; McFadden, D.G.; Durante, C. Thyroid Cancer. Lancet 2016, 388, 2783–2795. [Google Scholar] [CrossRef]
  4. Durante, C.; Haddy, N.; Baudin, E.; Leboulleux, S.; Hartl, D.; Travagli, J.P.; Caillou, B.; Ricard, M.; Lumbroso, J.D.; De Vathaire, F.; et al. Long-Term Outcome of 444 Patients with Papillary Thyroid Cancer. J. Clin. Endocrinol. Metab. 2006, 91, 2892–2899. [Google Scholar] [CrossRef]
  5. Haugen, B.R. 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid 2017, 123, 372–381. [Google Scholar] [CrossRef] [PubMed]
  6. Tuttle, R.M.; Tala, H.; Shah, J.; Leboeuf, R.; Ghossein, R.; Gonen, M.; Brokhin, M.; Omry, G.; Fagin, J.A.; Shaha, A. Estimating risk of recurrence in differentiated thyroid cancer after total thyroidectomy and radioactive iodine remnant ablation: Using response to therapy variables to modify the initial risk estimates predicted by the new American Thyroid Association staging system. Thyroid 2010, 20, 1341–1349. [Google Scholar] [PubMed]
  7. Lloyd, R.V.; Osamura, R.Y.; Klöppel, G.; Rosai, J. (Eds.) WHO Classification of Tumours of Endocrine Organs, 4th ed.; IARC: Lyon, France, 2017. [Google Scholar]
  8. Brignardello, E.; Gallo, M.; Baldi, I.; Palestini, N.; Piovesan, A.; Grossi, E.; Ciccone, G.; Boccuzzi, G. Anaplastic thyroid carcinoma: Clinical outcome of 30 consecutive patients referred to a single institution in the past 5 years. Eur. J. Endocrinol. 2007, 156, 425–430. [Google Scholar] [CrossRef]
  9. Kitahara, C.M.; Schneider, A.B. Epidemiology of thyroid cancer. Cancer Epidemiol. Biomark. Prev. 2022, 31, 1284–1297. [Google Scholar] [CrossRef] [PubMed]
  10. Sasson, M.; Kay-Rivest, E.; Shoukrun, R.; Florea, A.; Hier, M.; Forest, V.I.; Mlynarek, A.; Payne, R.J. The T4/T3 quotient as a risk factor for differentiated thyroid cancer: A case-control study. J. Otolaryngol. Head Neck Surg. 2017, 46, 28. [Google Scholar] [CrossRef]
  11. Rogusz, A.E. Lived Experience of People with Thyroid Cancer and Factors That Affect It: A Phenomenology Study. Ph.D. Thesis, University of Northumbria at Newcastle, Newcastle upon Tyne, UK, 2022. [Google Scholar]
  12. Randolph, G.W.; Duh, Q.Y.; Heller, K.S.; LiVolsi, V.A.; Mandel, S.J.; Steward, D.L.; Tufano, R.P.; Tuttle, R.M.; American Thyroid Association Surgical Affairs Committee’s Taskforce on Thyroid Cancer Nodal Surgery. The prognostic significance of nodal metastases from papillary thyroid carcinoma can be stratified based on the size and number of metastatic lymph nodes, as well as the presence of extranodal extension. Thyroid 2012, 22, 1144–1152. [Google Scholar] [CrossRef]
  13. Mete, O.; Asa, S.L. Pathological definition and clinical significance of vascular invasion in thyroid carcinomas of follicular epithelial derivation. Mod. Pathol. 2011, 24, 1545–1552. [Google Scholar] [CrossRef]
  14. Bongiovanni, M.; Mazzucchelli, L.; Giovanella, L.; Frattini, M.; Pusztaszeri, M. Well-differentiated follicular patterned tumors of the thyroid with high-grade features can metastasize in the absence of capsular or vascular invasion: Report of a case. Int. J. Surg. Pathol. 2014, 22, 749–756. [Google Scholar] [CrossRef]
  15. Yin, D.T.; Yu, K.; Lu, R.Q.; Li, X.; Xu, J.; Lei, M. Prognostic impact of minimal extrathyroidal extension in papillary thyroid carcinoma. Medicine 2016, 95, e5794. [Google Scholar] [CrossRef] [PubMed]
  16. Ferrari, S.M.; Fallahi, P.; Politti, U.; Materazzi, G.; Baldini, E.; Ulisse, S.; Miccoli, P.; Antonelli, A. Molecular Targeted Therapies of Aggressive Thyroid Cancer. Front. Endocrinol. 2015, 6, 176. [Google Scholar] [CrossRef] [PubMed]
  17. Li, Y.; Tang, Z.; Ren, A.; Tian, G.; Zhang, J.; Wang, Y.; Liu, J.; Ming, J. A Machine Learning-Based Model for Predicting Recurrence in Intermediate- and High-Risk Differentiated Thyroid Cancer: Insights from a Retrospective Single-Center Study of 2388 Patients. Front. Endocrinol. 2025, 16, 1552479. [Google Scholar] [CrossRef]
  18. Byrd, D.R.; Brookland, R.K.; Washington, M.K.; Gershenwald, J.E.; Compton, C.C.; Hess, K.R.; Meyer, L.R.; Amin, M.B.; Edge, S.B.; Greene, F.L. (Eds.) AJCC Cancer Staging Manual, 8th ed.; Springer: New York, NY, USA, 2017. [Google Scholar]
  19. Tuttle, R.M.; Alzahrani, A.S. Risk stratification in differentiated thyroid cancer: From detection to final follow-up. J. Clin. Endocrinol. Metab. 2019, 104, 4087–4100. [Google Scholar] [CrossRef]
  20. Tuttle, R.M. Risk-adapted management of thyroid cancer. Endocr. Pract. 2008, 14, 764–774. [Google Scholar] [CrossRef] [PubMed]
  21. Momesso, D.P.; Tuttle, R.M. Update on differentiated thyroid cancer staging. Endocrinol. Metab. Clin. N. Am. 2014, 43, 401–421. [Google Scholar] [CrossRef]
  22. Sobrinho-Simões, M.; Magalhães, J.; Fonseca, E.; Amendoeira, I. Diagnostic pitfalls of thyroid pathology. Curr. Diagn. Pathol. 2005, 11, 52–59. [Google Scholar] [CrossRef]
  23. Lamartina, L.; Handkiewicz-Junak, D. Follow-up of low-risk thyroid cancer patients: Can we stop follow-up after 5 years of complete remission? Eur. J. Endocrinol. 2020, 182, D1–D16. [Google Scholar] [CrossRef]
  24. Lixandru-Petre, I.O.; Dima, A.; Musat, M.; Dascalu, M.; Gradisteanu Pircalabioru, G.; Iliescu, F.S.; Iliescu, C. Machine learning for thyroid cancer detection, presence of metastasis, and recurrence predictions—A scoping review. Cancers 2025, 17, 1308. [Google Scholar] [CrossRef]
  25. Rudin, C. Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
  26. Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
  27. Wang, H.; Zhang, C.; Li, Q.; Tian, T.; Huang, R.; Qiu, J.; Tian, R. Development and validation of prediction models for papillary thyroid cancer structural recurrence using machine learning approaches. BMC Cancer 2024, 24, 427. [Google Scholar] [CrossRef]
  28. Zhan, Z.; Chen, B.; Cheng, H.; Xu, S.; Huang, C.; Zhou, S.; Wu, Y.; Liu, J.; Zhang, L. Identification of prognostic signatures in remnant gastric cancer through an interpretable risk model based on machine learning: A multicenter cohort study. BMC Cancer 2024, 24, 547. [Google Scholar] [CrossRef] [PubMed]
  29. Bai, Z.; Chang, L.; Yu, R.; Li, X.; Wei, X.; Yu, M.; Liu, X.; Zhang, Z. Thyroid nodules risk stratification through deep learning based on ultrasound images. Med. Phys. 2020, 47, 6355–6365. [Google Scholar] [CrossRef]
  30. Chattopadhyay, S. Towards predicting recurrence risk of differentiated thyroid cancer with a hybrid machine learning model. Medinformatics 2024, 1–9. [Google Scholar] [CrossRef]
  31. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  32. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  33. Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
  34. Lundberg, S.M.; Nair, B.; Vavilala, M.S.; Horibe, M.; Eisses, M.J.; Adams, T.; Liston, D.E.; Low, D.K.-W.; Newman, S.-F.; Kim, J.; et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018, 2, 749–760. [Google Scholar] [CrossRef] [PubMed]
  35. Borzooei, S.; Tarokhian, A. Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository, 2023. Available online: https://archive.ics.uci.edu/dataset/915/differentiated+thyroid+cancer+recurrence (accessed on 18 October 2025).
  36. Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 1994; pp. 171–182. [Google Scholar]
  37. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow of the proposed machine-learning and explainable AI framework for risk stratification in differentiated thyroid cancer.
Figure 1. Workflow of the proposed machine-learning and explainable AI framework for risk stratification in differentiated thyroid cancer.
Biomedicines 13 02964 g001
Figure 2. Comparison of feature importance between ReliefF and MRMR algorithms.
Figure 2. Comparison of feature importance between ReliefF and MRMR algorithms.
Biomedicines 13 02964 g002
Figure 3. Confusion matrix (a) and ROC curves (b) of the optimizable neural network model, demonstrating strong class-wise agreement and high discriminative performance (AUCs: 0.85, 0.94, 0.96; micro-AUC: 0.94).
Figure 3. Confusion matrix (a) and ROC curves (b) of the optimizable neural network model, demonstrating strong class-wise agreement and high discriminative performance (AUCs: 0.85, 0.94, 0.96; micro-AUC: 0.94).
Biomedicines 13 02964 g003
Figure 4. Confusion matrix (a) and ROC curves (b) of the optimizable tree model (Scenario III), showing strong classification across all three classes and high discriminative performance (AUC: 0.93–0.96).
Figure 4. Confusion matrix (a) and ROC curves (b) of the optimizable tree model (Scenario III), showing strong classification across all three classes and high discriminative performance (AUC: 0.93–0.96).
Biomedicines 13 02964 g004
Figure 5. SHAP summary plot of the optimizable neural network model (Scenario II).
Figure 5. SHAP summary plot of the optimizable neural network model (Scenario II).
Biomedicines 13 02964 g005
Figure 6. SHAP summary plot of the optimizable tree model (Scenario III).
Figure 6. SHAP summary plot of the optimizable tree model (Scenario III).
Biomedicines 13 02964 g006
Figure 7. Class-wise SHAP summary plots for the optimizable neural network model under Scenario II, illustrating feature contributions for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. Each point represents an individual patient, with color indicating normalized feature values and horizontal position reflecting the SHAP impact on class-specific predictions.
Figure 7. Class-wise SHAP summary plots for the optimizable neural network model under Scenario II, illustrating feature contributions for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. Each point represents an individual patient, with color indicating normalized feature values and horizontal position reflecting the SHAP impact on class-specific predictions.
Biomedicines 13 02964 g007
Figure 8. Class-wise SHAP summary plots for the optimizable tree model (Scenario III), illustrating feature contributions for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. Each point indicates an individual sample, with color denoting scaled feature values and horizontal position representing the SHAP impact on the class-specific prediction.
Figure 8. Class-wise SHAP summary plots for the optimizable tree model (Scenario III), illustrating feature contributions for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. Each point indicates an individual sample, with color denoting scaled feature values and horizontal position representing the SHAP impact on the class-specific prediction.
Biomedicines 13 02964 g008
Figure 9. Boxplot SHAP summaries for the optimizable neural network model across (a) Low-, (b) Intermediate-, and (c) High-risk classes, showing the distribution and direction of feature contributions for the six key predictors. Positive SHAP values indicate increased likelihood of the respective class, while negative values reflect decreased class probability.
Figure 9. Boxplot SHAP summaries for the optimizable neural network model across (a) Low-, (b) Intermediate-, and (c) High-risk classes, showing the distribution and direction of feature contributions for the six key predictors. Positive SHAP values indicate increased likelihood of the respective class, while negative values reflect decreased class probability.
Biomedicines 13 02964 g009
Figure 10. Class-wise boxplot SHAP summaries for the optimizable tree model (Scenario III), illustrating feature impacts for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. The boxplots show the distribution and direction of Shapley values for each predictor, where positive values increase the probability of the corresponding class and negative values decrease it.
Figure 10. Class-wise boxplot SHAP summaries for the optimizable tree model (Scenario III), illustrating feature impacts for (a) Low-risk, (b) Intermediate-risk, and (c) High-risk classes. The boxplots show the distribution and direction of Shapley values for each predictor, where positive values increase the probability of the corresponding class and negative values decrease it.
Biomedicines 13 02964 g010
Table 1. Description of variables used in the Differentiated Thyroid Cancer dataset.
Table 1. Description of variables used in the Differentiated Thyroid Cancer dataset.
Variable NameTypeDescription
1AgeNumericAge of patient at diagnosis
2GenderCategoricalSex of patient
3Hx SmokingCategoricalHistory of smoking
4Hx RadiotherapyCategoricalPrior radiotherapy exposure
5Thyroid FunctionCategoricalFunctional thyroid status
6Physical ExaminationCategoricalFindings on clinical exam
7AdenopathyCategoricalPresence of lymph node enlargement
8PathologyCategoricalHistologic subtype
9FocalityCategoricalUnifocal vs. multifocal lesion
10ResponseCategoricalPostoperative response to initial therapy
11TOrdinalTumor size/extent (AJCC T stage)
12NOrdinalRegional lymph-node involvement
13MOrdinalDistant metastasis
Table 2. Encoding scheme for ordinal clinical features and target variable.
Table 2. Encoding scheme for ordinal clinical features and target variable.
FeatureOriginal ValueEncoded Value
TT1a0
T1b1
T22
T3a3
T3b4
T4a5
T4b6
NN00
N1a1
N1b2
MM00
M11
Table 3. Encoding scheme for categorical clinical features.
Table 3. Encoding scheme for categorical clinical features.
FeatureOriginal ValueEncoded Value
GenderF0
M1
Hx SmokingNo0
Yes1
Hx RadiotherapyNo0
Yes1
FocalityMulti-Focal0
Uni-Focal1
Table 4. Encoding scheme for multi-category clinical variables.
Table 4. Encoding scheme for multi-category clinical variables.
FeatureOriginal ValueEncoded Value
AdenopathyNo0
Left1
Right1
Posterior2
Bilateral3
Extensive4
PathologyMicropapillary0
Papillary1
Follicular2
Hurthel cell3
ResponseExcellent0
Indeterminate1
Biochemical Incomplete2
Structural Incomplete3
Note: (i) Left and Right are both mapped to 1 following the provided scheme. This collapses laterality and may attenuate discriminatory signal related to side-specific disease. We retained the given scheme to match the clinical coding in the source workflow; if laterality is clinically relevant, a revised map (e.g., Left = 1, Right = 2) or one-hot encoding should be considered in sensitivity analyses. (ii) Caution (clinical/temporal): Response is often evaluated post-operatively (ATA dynamic risk stratification). If the modeling intent is pre-treatment risk estimation, including Response may constitute target leakage. We therefore treated Response as optional and excluded it from the primary model; it was only used in secondary, clearly demarcated analyses where temporality was appropriate. (iii) Two categorical variables—Thyroid Function and Physical Examination—were encoded using one-hot encoding prior to model development.
Table 5. Summary of Statistical Analysis Results (Ordinal Logistic Regression for Risk Stratification).
Table 5. Summary of Statistical Analysis Results (Ordinal Logistic Regression for Risk Stratification).
Predictorχ2 (LRT)p-ValueSignificanceDirection/EffectOdds Ratio (OR)Interpretation
Age6.00.014Significant↑ Age → ↑ Risk1.03Older patients show slightly higher risk.
Gender1.00.30ns1.65No significant gender effect.
Hx Smoking1.70.20ns0.31Smoking history not predictive.
Hx Radiotherapy4.80.027Significant76–96Previous radiation exposure strongly associated with higher risk.
Adenopathy1.00.59ns0.9–2.0No significant lymph-node enlargement effect after adjusting.
Pathology4.00.26nsHistological subtype not statistically predictive.
Focality0.120.73ns1.16Unifocality not associated with higher risk.
T-stage74–79<0.001Highly significant7.7–4634Strongest predictor of higher risk; merged T levels improved stability.
N-stage36–84<0.001Highly significant25–105Increasing lymph-node involvement markedly elevates risk.
M-stage18–20<0.001Highly significant90–113Distant metastasis is a strong risk escalator.
Response7.9–8.30.04Significant2.7–5.7Poor response to initial therapy predicts higher risk.
Model fit (overall)<0.001χ2 = 451, R2McF = 0.70, N = 383
Thresholds (Low→Int/Int→High)<0.001β1 = 5.7; β2 = 13.5Distinct separation between ordered risk levels.
Note: ns = not significant (p > 0.05). All tests based on Ordinal Logistic Regression using Risk (Low–Intermediate–High) as dependent variable. Final model selected after collapsing T4a + T4b → T4 and T1 + T2 T1_2 to reduce category imbalance.
Table 6. Comparison of feature importance scores between ReliefF and MRMR algorithms.
Table 6. Comparison of feature importance scores between ReliefF and MRMR algorithms.
RankFeatureReliefF ScoreMRMR ScoreDominant Algorithm RankingInterpretation
1T0.09410.3411Both Top 1Tumor size/extent is consistently the most influential variable.
2M0.05050.1061Both Top 5Presence of distant metastasis markedly increases risk.
3Response0.05030.1765MRMRCaptures postoperative biological behavior reflecting disease aggressiveness.
4N0.04850.1765MRMRRegional lymph-node involvement is a strong driver of higher-risk classification.
5Age0.04330.0313ReliefFOlder age contributes modestly to elevated risk levels.
6Adenopathy0.02820.1106MRMRClinical lymph-node enlargement adds secondary discriminatory power.
7Pathology0.02120.0703MRMRHistologic subtype has moderate association with risk.
8Gender0.01160.0913MRMRMinor contribution; male sex slightly increases risk.
9Hx Radiotherapy0.00020.2445MRMRPrior radiation exposure highly relevant when redundancy minimized.
10Focality−0.00620.1730MRMRMultifocal tumors moderately associated with higher risk after redundancy control.
Table 7. Combined Feature Selection Results from Statistical and Machine Learning Analyses.
Table 7. Combined Feature Selection Results from Statistical and Machine Learning Analyses.
FeatureEvidence from Statistical Analysis (OLR)Evidence from Machine Learning (ReliefF/MRMR)Final Inclusion DecisionJustification
THighly significant (χ2 ≈ 74–79, p < 0.001)Top rank (ReliefF ≈ 0.094; MRMR ≈ 0.341)IncludeStrongest, consistent driver of higher risk.
NHighly significant (χ2 ≈ 37–84, p < 0.001)High (MRMR ≈ 0.177)IncludeRegional spread markedly elevates risk.
MHighly significant (χ2 ≈ 18–20, p < 0.001)High (ReliefF ≈ 0.051; MRMR ≈ 0.106)IncludeDistant metastasis strongly linked to high risk.
AgeSignificant (p ≈ 0.014–0.020)Moderate (ReliefF ≈ 0.043; MRMR ≈ 0.031)IncludeOlder age slightly increases risk level.
Hx RadiotherapySignificant (p ≈ 0.027–0.031)High in MRMR (≈ 0.245), low in ReliefF (≈ 0.0002)IncludeClinically meaningful; aligns with high-risk referral patterns.
ResponseSignificant (p ≈ 0.03–0.04), graded ORsHigh (MRMR ≈ 0.177; ReliefF ≈ 0.050)IncludeCaptures postoperative biological behavior; jointly analyzed with Risk per study aim.
AdenopathyNot significant (p ≈ 0.59–0.62)Moderate (MRMR ≈ 0.111)OptionalAdds secondary discrimination in ML ranking.
PathologyNot significant (p ≈ 0.26–0.28)Moderate (MRMR ≈ 0.070)OptionalHistologic subtype provides supplementary signal.
GenderNot significant (p ≈ 0.30)Low–moderate (MRMR ≈ 0.091)OptionalSmall effect; keep for completeness.
Table 8. Performance comparison of machine-learning models under two feature-selection scenarios.
Table 8. Performance comparison of machine-learning models under two feature-selection scenarios.
FeaturesModel TypeAccuracy % (Validation)Accuracy % (Test)Precision % (Test)Recall % (Test)F1 Score % (Test)AUC
Scenario ITree88.9986.8488.9681.6983.770.9469
Discriminant85.8086.8489.0371.9774.540.9559
Efficient Logistic Regression88.1286.8488.8980.0583.310.9615
Naive Bayes87.8381.5885.0664.2768.950.7985
SVM87.8378.9573.0474.2473.430.8624
Efficient Linear88.1286.8488.8980.0583.310.9615
KNN85.8076.3281.4761.4965.880.8560
Kernel86.6786.8489.9480.0583.870.9466
Neural Network88.4189.4790.8683.0885.650.9661
Scenario IITree86.0989.4791.1184.7285.980.9186
Discriminant85.8086.8489.0371.9774.540.9142
Efficient Logistic Regression88.4189.4790.8683.0885.650.9556
Naive Bayes86.0986.8489.4978.4182.730.9322
SVM89.5784.2186.8578.6681.450.9076
Efficient Linear88.4189.4790.8683.0885.650.9556
KNN86.6781.5885.0677.2779.650.9001
Kernel89.8689.4791.1184.7285.980.9054
Neural Network88.4192.1192.8686.1187.880.9045
Scenario IIITree88.1292.1192.8686.1187.880.9302
Discriminant86.3886.8489.0371.9774.540.8731
Efficient Logistic Regression86.9689.4790.8683.0885.650.9734
Naive Bayes84.0684.2186.9768.9472.220.7404
SVM90.1489.4791.1184.7285.980.8948
Efficient Linear86.9689.4790.8683.0885.650.9734
KNN87.2581.5884.7275.6379.020.8625
Kernel89.5789.4790.8683.0885.650.9007
Neural Network89.2889.4791.1184.7285.980.9331
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khwanmuang, M.; Cholamjiak, W.; Sukson, P. Explaining Risk Stratification in Differentiated Thyroid Cancer Using SHAP and Machine Learning Approaches. Biomedicines 2025, 13, 2964. https://doi.org/10.3390/biomedicines13122964

AMA Style

Khwanmuang M, Cholamjiak W, Sukson P. Explaining Risk Stratification in Differentiated Thyroid Cancer Using SHAP and Machine Learning Approaches. Biomedicines. 2025; 13(12):2964. https://doi.org/10.3390/biomedicines13122964

Chicago/Turabian Style

Khwanmuang, Mallika, Watcharaporn Cholamjiak, and Pasa Sukson. 2025. "Explaining Risk Stratification in Differentiated Thyroid Cancer Using SHAP and Machine Learning Approaches" Biomedicines 13, no. 12: 2964. https://doi.org/10.3390/biomedicines13122964

APA Style

Khwanmuang, M., Cholamjiak, W., & Sukson, P. (2025). Explaining Risk Stratification in Differentiated Thyroid Cancer Using SHAP and Machine Learning Approaches. Biomedicines, 13(12), 2964. https://doi.org/10.3390/biomedicines13122964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop