1. Introduction
Differentiated thyroid cancer (DTC)—comprising papillary (PTC) and follicular (FTC) carcinoma—accounts for the vast majority of thyroid malignancies worldwide [
1,
2,
3]. Although overall disease-specific survival is excellent, a clinically meaningful subset of patients develops persistent or recurrent disease during follow-up, with reported 10-year recurrence proportions ranging from ~15% to nearly 30% depending on cohort composition, disease stage, and surveillance intensity [
1,
4,
5,
6]. PTC is the predominant subtype; FTC represents ~10–15% of cases and is distinguished by a higher propensity for hematogenous spread due to vascular invasion, which underpins its metastatic behavior to lung and bone [
2,
7,
8].
Thyroid biology provides additional context for risk behavior. The gland synthesizes thyroxine (T4) and triiodothyronine (T3), under control of thyroid-stimulating hormone (TSH). While TSH stimulation has long been implicated in thyroid tumorigenesis, emerging data suggest that higher circulating free T4 (FT4), altered FT4/FT3 ratios, and hyperthyroxinemia may also associate with malignant transformation and adverse phenotypes in DTC [
9,
10,
11]. These endocrine signals intersect with the established clinicopathologic determinants of recurrence—tumor size and extrathyroidal extension, lymph-node metastasis, distant spread, histologic subtype, vascular/lymphovascular invasion, and multifocality [
5,
7,
12,
13,
14,
15]. Therapeutic strategies differ markedly between well-differentiated and dedifferentiated or otherwise aggressive thyroid cancer subtypes. While management of DTC typically relies on surgery followed by radioiodine therapy, dedifferentiated tumors—characterized by loss of iodine-avid function—often require systemic molecular-targeted agents, particularly tyrosine kinase inhibitors. Recognizing these distinctions highlights the clinical importance of accurately identifying key prognostic predictors in DTC [
16]. Furthermore, because a subset of patients with DTC develop disease recurrence, early recognition of recurrence-associated risk factors remains essential. Recent evidence [
17] indicates that intraglandular dissemination, tumor size, bilateral cervical lymph node involvement, and coexisting Hashimoto’s thyroiditis represent major risk determinants, underscoring the need for more refined and individualized risk-stratification approaches.
Two complementary frameworks anchor contemporary practice. The AJCC TNM system stratifies anatomic extent to inform prognosis and survival counseling [
18], while the American Thyroid Association (ATA) Risk Stratification System integrates postoperative clinicopathologic variables to classify recurrence risk as low, intermediate, or high and to guide decisions regarding radioactive iodine (RAI), thyrotropin suppression, and follow-up intensity [
5,
19]. Dynamic risk stratification refines these categories over time by incorporating treatment response (biochemical and structural) during surveillance [
20,
21].
Despite their clinical utility, TNM and ATA risk frameworks leave important needs unmet:
False-negative risk assignment. Even patients initially categorized as ATA low risk can recur during long-term follow-up; intermediate-risk misclassification remains non-trivial in several series, reflecting underestimation of true recurrence probability [
4,
6,
20].
Observer variability. Key histologic features—especially vascular/lymphovascular invasion and minimal extrathyroidal extension—show modest interobserver agreement, with discordance reported in multi-institutional studies, contributing to inconsistent risk assignment [
13,
14,
15,
22].
Incomplete documentation. Missing or inconsistent ATA reporting, variable quantification of nodal burden, and heterogeneous capture of extrathyroidal extension complicate cross-institutional comparisons and registry-based research [
3,
5].
Static, siloed data. Classical schemes rely on static postoperative pathology and rarely integrate longitudinal labs, imaging trajectories, and narrative reports in a unified, quantitative manner [
20,
23].
Limited interpretability of newer tools. While AI models can outperform traditional schemes on discrimination metrics, their “black-box” nature and lack of transparent, case-level explanations hinder clinical adoption [
24,
25,
26].
Recent studies have applied machine learning (ML)—including support vector machines, random forests, gradient boosting, decision trees, artificial neural networks (ANNs), and CatBoost—to improve recurrence prediction and dynamic risk reassessment, achieving high sensitivities and AUROC values in single-center cohorts [
24,
27,
28,
29,
30,
31,
32]. Typical predictors include age and sex; TNM components; multifocality; intraglandular dissemination; Hashimoto’s thyroiditis; smoking and prior radiotherapy; and early treatment response [
24,
28,
29,
30,
31]. However, dataset heterogeneity, site-specific practice patterns, and limited generalizability remain concerns; crucially, clinicians require transparent explanations that map model outputs back to familiar risk factors.
SHAP (SHapley Additive exPlanations) offers a principled, model-agnostic solution by decomposing a prediction into additive contributions from each feature based on cooperative game theory [
33,
34]. SHAP can (i) rank global drivers of risk and (ii) provide patient-level attributions that reveal how, for example, nodal status, tumor extent, and treatment response jointly alter predicted risk. This is especially relevant because risk classification itself is not directly measurable: it depends on expert synthesis of multiple variables—capsular and vascular invasion, extrathyroidal extension, multifocality, and histologic subtype—whose accurate assessment often requires highly experienced pathologists. Borderline cases (e.g., minimal extrathyroidal extension or focal vascular invasion) are susceptible to inter-observer variability, introducing latent label noise that may degrade model calibration if unaddressed [
13,
14,
15,
22]. By learning reproducible multi-feature patterns and pairing them with SHAP explanations, ML models can complement clinician judgment and mitigate subjective variability in risk assessment.
Building on these gaps, we develop and explain ML models for risk stratification in differentiated thyroid cancer using a curated set of clinicopathologic features (including treatment response when appropriate) and quantify feature contributions with SHAP. We (i) compare feature-selection strategies, (ii) evaluate multiple classifiers against standard metrics, and (iii) use SHAP to align model behavior with clinical expectations around TNM burden, response, and vascular invasion—thereby delivering an interpretable, clinically consonant framework for recurrence risk assessment.
3. Results
3.1. Results of Statistical Analysis
The statistical analysis was conducted using ordinal logistic regression (OLR) to identify significant predictors of risk stratification in differentiated thyroid cancer. The dependent variable was Risk (ordered as Low → Intermediate → High). The full model incorporated all clinicopathological variables, and subsequent adjustments were performed by merging certain T subcategories to address class imbalance among tumor stages.
From
Table 5, the omnibus likelihood ratio tests (LRT) identified T-stage, N-stage, and M-stage as the most influential factors associated with increasing Risk levels (
p < 0.001 for all three). The Response to initial therapy (
p ≈ 0.04), Hx Radiotherapy (
p ≈ 0.027), and Age (
p ≈ 0.014) also contributed significantly. Other variables, including Gender, Smoking history, Focality, Pathology, and Adenopathy, were not statistically significant (
p > 0.05). The ranking of effect importance based on χ
2 values indicated a clear hierarchy: T > N > M > Response > Hx Radiotherapy > Age. Initial analysis revealed that T-stage contained multiple sublevels (T1a–T4b), with some categories having very small sample sizes, leading to unstable odds ratio (OR) estimates. Therefore, clinically related subgroups were merged (e.g., T4a + T4b → T4, and T1 + T2 → T1_2). After adjustment, the model maintained comparable fit (χ
2 = 447,
p < 0.001; McFadden R
2 = 0.695), confirming the robustness of the predictors.
After adjusting for covariates:
Age (OR = 1.03, p ≈ 0.02) modestly increased the likelihood of higher risk with advancing age.
Hx Radiotherapy had a strong positive association with higher risk (OR ≈ 76–96, p ≈ 0.01–0.03).
Metastasis (M1) markedly increased risk (OR ≈ 90–113, p < 0.001).
N1b nodes conferred the greatest effect (OR ≈ 86–105, p < 0.001), while N1a also remained significant (OR ≈ 25–33, p < 0.001).
Response categories also showed graded associations: compared with Excellent, Structural Incomplete (OR ≈ 3.9, p ≈ 0.03) and Biochemical Incomplete (OR ≈ 4.8–5.7, p ≈ 0.02–0.03) were linked with elevated risk.
Non-significant variables included Gender, Smoking, Adenopathy, Pathology, and Focality (all p > 0.1).
The estimated thresholds separating Low → Intermediate (β ≈ 5.7, p < 0.001) and Intermediate → High (β ≈ 13.5, p < 0.001) were statistically significant, demonstrating distinct boundaries between risk categories.
Clinically, the dominance of T, N, and M as predictors reflects the underlying TNM staging logic, which inherently defines tumor progression and spread. However, the ordinal regression analysis quantifies their relative contribution and confirms that T-stage exerts the strongest incremental effect on risk escalation. The strong effect of Hx Radiotherapy likely reflects referral bias in higher-risk patients previously exposed to radiation. The Response variable captures postoperative biological behavior and indicates that biological aggressiveness after surgery remains a strong determinant of risk classification.
3.2. Feature Selection Using Machine Learning Approaches
To enhance model interpretability and generalizability, feature selection, predictive modeling, and model explainability were performed using machine learning workflows implemented in MATLAB’s Classification Learner, including SHAP-based analysis to quantify feature contribution. While statistical analysis (
Section 3.1) identified significant variables through inferential testing, machine learning (ML)-based feature selection enables multivariate and non-linear evaluation of variable importance, which is particularly useful for complex clinical data such as thyroid cancer risk stratification.
This dual approach ensures that both statistical significance and predictive relevance are systematically assessed before model development.
Two supervised ML algorithms were employed to rank and select influential predictors:
ReliefF algorithm [
36]—a neighborhood-based feature ranking technique that evaluates each variable’s ability to differentiate between neighboring instances belonging to different risk categories. It captures non-linear and interaction effects among variables without assuming any specific data distribution.
MRMR (Minimum Redundancy Maximum Relevance) [
37]—an information-theoretic algorithm that identifies features exhibiting the highest relevance to the target variable (Risk) while minimizing redundancy among correlated predictors. This approach enhances model efficiency and reduces collinearity between features such as T, N, and M.
Both algorithms were applied to the encoded dataset (excluding Stage due to redundancy with TNM variables). The computed importance scores from ReliefF and MRMR were compared to identify consistent patterns and complementary insights. The resulting rankings and visualized feature importance profiles are presented in
Figure 2 and
Table 6, respectively.
Both machine learning algorithms identified T-stage as the most dominant predictor of thyroid cancer risk stratification. The ReliefF algorithm emphasized overall stage (T > Stage > M > Response > N > Age), focusing on neighborhood-based differentiation, whereas MRMR prioritized features with high information gain and minimal overlap (T > Hx Radiotherapy > N > Response > Focality). These complementary results confirm the clinical hierarchy of T > N > M, and highlight the influence of prior radiotherapy and postoperative response, which will be further examined through SHAP-based explainability analyses in the following section.
3.3. Machine Learning Model Development
After identifying the statistically and clinically relevant variables, machine learning (ML) model development was performed to predict the categorical risk levels (Low, Intermediate, High) of differentiated thyroid cancer. The primary goal of this phase was to construct a robust and interpretable predictive model that integrates both statistical inference and data-driven learning to improve risk stratification beyond conventional ordinal regression. Prior to training, feature selection was finalized by integrating results from the statistical analysis (Ordinal Logistic Regression) and ML-based feature ranking algorithms (ReliefF and MRMR). This hybrid selection strategy ensured that only the most informative, non-redundant, and clinically meaningful predictors were included, while variables that could cause redundancy (Stage) or post-treatment leakage (Response) were excluded. The resulting feature set represents the optimal compromise between model simplicity, interpretability, and predictive strength.
The final selected features used for model training are summarized in
Table 7. These variables were used to train multiple supervised learning algorithms in subsequent steps, including logistic regression, random forest, and gradient boosting, with hyperparameter tuning and five-fold cross-validation to ensure model generalizability.
According to the integrated evidence from both statistical and machine learning analyses, two distinct feature selection scenarios were defined for subsequent model training:
Scenario I—Extended Feature Set (Include + Optional): This configuration expands upon the core set by including additional variables that demonstrated moderate importance or complementary predictive potential in machine learning ranking. The extended feature set includes T, N, M, Age, Hx Radiotherapy, Response, Adenopathy, Pathology, and Gender.
Scenario II—Core Feature Set (Include only): This configuration contains only the variables confirmed as statistically and clinically significant in both analytical approaches. The selected features are T, N, M, Age, Hx Radiotherapy, and Response. These features represent the essential predictors for baseline risk modeling and provide the foundation for developing the main predictive framework.
Scenario III—Reduced Preoperative Feature Set (Include—Response): This scenario excludes the postoperative Response variable to focus strictly on preoperative predictors. The remaining five features—T, N, M, Age, and Hx Radiotherapy—represent the key variables that retain strong statistical and machine-learning support. This configuration evaluates the model’s ability to perform true preoperative risk stratification using only baseline information.
To ensure methodological robustness and reduce overfitting risk, the dataset was divided into three parts: (1) a training set with 5-fold cross-validation for model learning, (2) a validation set for hyperparameter optimization, and (3) a hold-out independent test set comprising 10% of the data for unbiased performance evaluation.
From
Table 8, reducing the feature set from ten to six variables (Scenario II) improved computational efficiency while maintaining comparable or enhanced predictive performance across most classifiers. The Neural Network demonstrated the highest performance in Scenarios I and II, with accuracy increasing from 89.47% to 92.11% and only a modest rise in macro F1 score. In Scenario III, where the postoperative Response variable was excluded, the Tree model emerged as the best-performing and most practically applicable classifier, achieving 88.12% accuracy, 92.11% precision, 92.86% recall, and 87.88% macro F1, alongside a strong AUC (~0.93). This indicates that the five strictly preoperative predictors (T, N, M, Age, and Hx Radiotherapy) retain sufficient discriminatory power for robust early risk stratification. The consistency and interpretability of the Tree model further support its suitability for real-world clinical deployment. Overall, the findings show that progressive feature reduction enhances parsimony and preserves predictive accuracy, with the Tree model offering the most effective balance of performance, simplicity, and clinical usability in the preoperative setting.
The optimizable neural network model (
Figure 3) demonstrated strong classification performance, with minimal misclassification and high discriminative ability (AUCs: 0.85–0.96; micro-AUC: 0.94), effectively separating all three DTC risk groups. Similarly, the optimizable tree model in Scenario III (
Figure 4) showed consistently high performance using only preoperative predictors, with class-wise AUCs of 0.93–0.96 and accurate identification across all classes. Together, these results confirm the robustness of both models, with the tree model offering a highly effective and clinically practical solution in the reduced preoperative feature setting.
3.4. Model Explainability Using SHAP Analysis
After identifying the neural network as the best-performing model in Scenario II (
Table 8), SHAP (SHapley Additive exPlanations) analysis was employed to interpret how individual predictors contributed to the model’s output probabilities across the three thyroid cancer risk categories (Low, Intermediate, and High). The SHAP framework provides a unified measure of feature contribution by quantifying each variable’s marginal impact on model predictions, thereby enhancing model transparency and clinical interpretability.
By computing the mean absolute Shapley values, we can determine which features exert the greatest influence on the neural network’s decision boundary. This approach allows clinicians and researchers to confirm whether the model’s learned relationships are consistent with established clinical reasoning and pathological patterns observed in differentiated thyroid cancer.
Figure 5 demonstrates that the neural network model in Scenario II relied primarily on N and T, with N showing the highest Shapley values and representing the dominant driver of prediction, followed by substantial contributions from T and the postoperative Response variable. Age had a moderate effect, while M and Hx Radiotherapy added minimal incremental value once nodal and local tumor burden were accounted for.
Figure 6 shows a similar pattern in Scenario III (excluding Response), where the Tree model again identified N as the strongest predictor, followed by T and Age, with M and Hx Radiotherapy contributing only marginally. This consistency across models reinforces the central role of nodal and tumor extent parameters in preoperative risk discrimination.
The class-wise SHAP plots from the neural network (
Figure 7) and tree model (
Figure 8) consistently reveal clear and clinically coherent patterns of feature influence across the three risk classes.
Low-risk predictions are characterized by low T and N values with small or negative SHAP contributions, indicating minimal tumor extension and absence of nodal metastasis. Response and Age show minimal influence, aligning with the favorable biologic profile of this group.
Intermediate-risk cases exhibit moderate positive SHAP shifts primarily from N and T, showing that even limited regional spread or modest tumor growth increases risk probability. Response provides additional separation—non-excellent responses push predictions upward, while excellent responses shift them downward.
High-risk predictions are dominated by high N and T values with strong positive SHAP effects, reflecting aggressive tumor behavior and regional metastasis. Response again plays a major role, with non-excellent outcomes yielding the highest SHAP values. M contributes intermittently in line with metastatic potential, while Hx Radiotherapy produces minimal impact in both models.
Across both modeling approaches, T and N emerge as the primary determinants of class separation, with Response providing dynamic post-treatment refinement and Age and M contributing modestly. These SHAP patterns affirm that the models differentiate risk categories in a clinically meaningful and guideline-consistent manner.
From
Figure 7,
Figure 8,
Figure 9 and
Figure 10, the class-wise SHAP box summary plots consistently demonstrate distinctive feature influence patterns across thyroid cancer risk levels in both the neural network and tree models.
SHAP values for most predictors are centered near zero with minimal spread, indicating limited contribution to risk elevation. Small variation in T and N confirms the generally indolent phenotype—small tumors without lymph node involvement. Low and stable SHAP values for Response further reflect favorable postoperative outcomes.
Wider interquartile ranges for N and T highlight their stronger and more variable impact in this transitional group. N remains the dominant driver, while T provides additional local invasion signal. Moderate contributions from Response and Age suggest their roles in refining borderline risk assignments—especially differentiating between good vs. suboptimal treatment response patients.
Positive SHAP distributions for N and T are the most pronounced, confirming extensive nodal burden and aggressive tumor growth as key determinants of high-risk status. Response exhibits consistently positive contributions, indicating that incomplete or poor treatment response strongly increases recurrence probability. Occasional influence from M aligns with advanced metastatic cases. Hx Radiotherapy remains negligible, supporting its role as a secondary indicator in both models.
Across all visualizations, T and N are the most decisive predictors of risk escalation, while Response acts as an important behavioral marker post-treatment. These class-specific SHAP patterns reinforce the clinical reliability and interpretability of the proposed models in stratifying differentiated thyroid cancer risk.
4. Discussion
This study developed an interpretable machine-learning framework for risk stratification in differentiated thyroid cancer (DTC), integrating clinical, pathological, and postoperative response variables. By combining neural network modeling with SHAP-based explainability, our approach provides both predictive accuracy and clinical interpretability—bridging a critical gap between statistical prediction and individualized patient management.
The model identified six key predictors—T, N, Response, Age, M, and Hx Radiotherapy—as the most influential determinants of recurrence risk. Among these, N and T exhibited the strongest positive SHAP contributions to high-risk classification, consistent with their established roles in tumor burden and local invasion. Cervical lymph node metastasis (N1b) remains the most important prognostic determinant in papillary thyroid carcinoma, directly influencing the extent of surgery, need for adjuvant radioactive iodine (RAI), and intensity of postoperative surveillance. Similarly, T classification reflects extrathyroidal extension and tumor size, aligning closely with the ATA and AJCC risk frameworks. Interestingly, the inclusion of Response as a dynamic postoperative variable substantially enhanced model precision. In clinical practice, treatment response integrates both biochemical (e.g., thyroglobulin trends) and structural evidence from imaging; thus, it serves as a real-time reflection of disease behavior beyond static pathology. The strong model contribution of Response indicates that incorporating longitudinal follow-up data can recalibrate initial risk assessment—echoing the principles of dynamic risk stratification advocated by Tuttle et al. [
19]. The finding that Hx Radiotherapy retained positive SHAP influence for higher-risk patients likely reflects referral bias: individuals previously exposed to head-and-neck radiation are more frequently managed in tertiary centers and often present with more aggressive disease phenotypes. This observation emphasizes that background treatment history remains an essential modifier of risk interpretation. The “Risk” labels used in this study are derived from the evidence-supported 2015 and 2025 ATA Risk of Recurrence stratification systems for differentiated thyroid carcinoma (DTC), rather than subjective expert judgment. These ATA frameworks integrate objectively validated clinicopathologic and molecular predictors—including tumor focality, extrathyroidal extension, vascular and angioinvasion, lymph node burden (size, number, LNR), AJCC staging, and treatment response variables. Large-scale cohort studies and meta-analyses have consistently demonstrated the predictive validity of these risk categories; for example, the 2015 ATA system documented recurrence rates of approximately 1.5% (low risk), 5.4% (intermediate risk), and 25% (high risk), with similar trends observed in T1a PTC and other DTC subtypes. These findings reinforce that the risk strata employed in this work reflect clinically validated constructs rather than arbitrary or subjective classification. Nevertheless, because ATA-based categories represent proxy indicators of recurrence rather than longitudinally observed outcomes, some degree of latent label noise may still be present—particularly in retrospective datasets. To mitigate this, the present study employed rigorous internal validation, including five-fold cross-validation, a dedicated validation subset, and a 10% independent hold-out test set to enhance generalization and minimize overfitting. Future work incorporating datasets with real longitudinal recurrence outcomes will further improve calibration and strengthen clinical applicability.
Furthermore, in Scenario III—where the postoperative Response variable was removed to reflect a strictly preoperative prediction setting—the Optimizable Tree model continued to demonstrate excellent performance, with accuracy and AUC values comparable to those observed in Scenario II. This finding indicates that robust recurrence-risk prediction in differentiated thyroid cancer can still be achieved without relying on postoperative data. Importantly, by using only preoperative clinical features (T, N, M, Age, and Hx Radiotherapy), the model enables actionable risk estimation before initial treatment decisions are made, supporting better surgical planning, early evaluation of adjuvant therapy needs, and individualized patient counseling. These results reinforce the model’s clinical utility and align with the goals of precision oncology, promoting proactive and data-driven personalized care even in the absence of longitudinal follow-up information.
From a clinical perspective, the explainable machine-learning framework complements traditional staging systems by offering individualized probability estimates rather than categorical labels. For example, patients with intermediate ATA risk but high SHAP-derived contributions from Response or N may warrant closer monitoring or early RAI ablation, even if they fall within a conservative management protocol. Conversely, those with low SHAP scores across all predictors may safely undergo de-escalated surveillance, thereby reducing unnecessary imaging and cost. The interpretability of SHAP outputs provides a major advantage for multidisciplinary tumor boards, enabling endocrinologists, surgeons, and nuclear medicine specialists to visualize how each factor contributes to recurrence probability. This transparency enhances clinical trust and supports shared decision-making, aligning computational outputs with pathophysiological reasoning. Moreover, explainable AI may reduce interobserver variability by offering a reproducible, data-driven supplement to human interpretation—particularly valuable in borderline cases such as minimal vascular invasion or focal extrathyroidal extension, where histopathologic consensus is often limited.
In the context of personalized medicine, the integration of explainable ML models allows risk prediction to evolve from population-based guidelines toward individualized treatment trajectories. Rather than classifying patients as low, intermediate, or high risk in static categories, SHAP-informed models generate a continuous risk spectrum that can adapt as new clinical or biochemical information emerges. This facilitates truly personalized surveillance—where follow-up frequency, imaging modality, and therapeutic intensity are dynamically matched to each patient’s evolving risk profile. Furthermore, the interpretability of the SHAP framework offers a mechanism for model validation across diverse patient cohorts and institutions. By comparing SHAP feature rankings in external datasets, researchers can assess whether predictive drivers remain stable or vary across populations—an essential step for equitable implementation of AI tools in precision oncology.
The proposed model underscores a paradigm shift from descriptive staging toward explainable, adaptive modeling that aligns with the principles of learning health systems. In clinical practice, such tools could be embedded within electronic medical records to provide real-time, interpretable risk updates as patient data accumulate. Future studies should aim to integrate molecular biomarkers (e.g., BRAF, RET mutations), radiomic features, and temporal biochemical trends to refine prediction granularity further. Ultimately, explainable AI in thyroid oncology holds promise not only for improving recurrence risk prediction but also for reinforcing clinical reasoning, promoting transparency, and advancing personalized, data-driven care for patients with differentiated thyroid cancer.
Although the proposed explainable machine-learning framework demonstrates strong predictive power and clear clinical interpretability, several limitations should be acknowledged. First, the dataset used in this study was derived from a single public source (UCI Differentiated Thyroid Cancer Recurrence dataset) with a relatively limited sample size. While the inclusion of key clinicopathologic features provides a robust foundation, certain parameters such as biochemical markers (e.g., thyroglobulin, anti-Tg antibodies), detailed histopathological subvariants, and molecular markers (e.g., BRAF, RET, RAS) were unavailable. These molecular determinants are increasingly recognized as major contributors to disease aggressiveness and may further enhance the precision of recurrence prediction if integrated into future models.
Second, the “Risk” label itself is an expert-derived variable rather than a directly measurable endpoint. This dependency introduces potential interobserver variability, particularly in the assessment of vascular invasion, extrathyroidal extension, and capsular infiltration. As a result, some degree of label uncertainty may propagate into the model, potentially affecting calibration. Incorporating probabilistic or fuzzy labeling frameworks may mitigate this issue by quantifying uncertainty at the input level.
Third, the present analysis employed a retrospective dataset without external validation. Although internal validation demonstrated high accuracy and AUC, prospective multicenter validation using real-world hospital registries is required to confirm generalizability across ethnicities, clinical settings, and diagnostic equipment. Cross-institutional collaborations and federated learning approaches could address this limitation while maintaining patient data privacy.
Future work should prioritize external validation using real-world hospital datasets to ensure model generalizability and strengthen clinical credibility. This study provides a foundation for developing an effective data acquisition strategy to systematically collect high-quality clinical features required for model deployment. Establishing robust external validation pipelines will support evidence-based integration of predictive models into routine practice and help guarantee that performance remains reliable across diverse patient populations and clinical settings.
Ultimately, explainable machine-learning tools like SHAP can transform endocrine oncology from a static, guideline-based discipline into a dynamic, learning system that continuously improves as more patient data become available. Such integration between artificial intelligence and clinical reasoning represents a critical step toward transparent, equitable, and patient-centered precision medicine in differentiated thyroid cancer.
Despite these limitations, this work provides several notable contributions to the following fields:
(1) It is among the first studies to systematically compare predictive performance under both postoperative and purely preoperative scenarios in DTC risk stratification, demonstrating that accurate prediction can be achieved even without Response data.
(2) The integration of SHAP-based explainability offers interpretable clinical insights that align with standard risk frameworks (ATA/AJCC), supporting transparent AI-assisted decision-making.
(3) The study establishes a practical modeling pipeline using widely accessible software (MATLAB Classification Learner (VR2025a)), enabling straightforward clinical translation and reproducibility in healthcare environments with limited computational resources.
(4) The findings help inform future data collection strategies for external validation and multimodal model enhancement, laying groundwork for deployment within real-world clinical workflows.
In this context, the emergence of precision oncology provides an important conceptual foundation for integrating individualized clinical, pathological, and treatment-related information into risk-adapted management strategies for DTC. Recent advances in artificial intelligence further support this paradigm by enabling data-driven, patient-specific prediction models that can complement traditional staging frameworks and enhance treatment precision. Accordingly, explainable machine-learning approaches—such as the framework proposed in this study—may serve as a bridge between statistical prediction and personalized clinical decision-making, reinforcing the shift toward more refined and patient-centered risk stratification.
5. Conclusions
This study demonstrates the clinical utility of two complementary explainable machine-learning approaches for recurrence risk stratification in differentiated thyroid cancer (DTC). The interpretable neural network model incorporating both static pathological indicators (T, N, M, Hx Radiotherapy) and dynamic postoperative variables (Response, Age) achieved the highest predictive performance, closely aligning with expert clinical reasoning. In parallel, the optimizable tree model—designed without the Response variable to reflect a purely preoperative prediction scenario—retained strong classification capability, confirming that reliable risk estimation can still be achieved prior to postoperative follow-up. Together, these models highlight flexible and clinically coherent strategies for precision risk assessment across the entire continuum of care.
Clinically, this approach bridges the gap between traditional risk classification systems (ATA, AJCC) and modern precision oncology. Rather than relying on categorical staging alone, SHAP-based interpretation enables patient-level visualization of how each clinical variable contributes to recurrence probability—facilitating informed, personalized management decisions. Such transparency supports multidisciplinary decision-making, promotes physician trust in AI systems, and allows continuous refinement of follow-up protocols according to evolving risk patterns.
From a broader perspective, this framework illustrates the potential of explainable artificial intelligence (XAI) to transform endocrine oncology into a learning health system, where every clinical encounter contributes to iterative model improvement. Future extensions that incorporate molecular, radiomic, and longitudinal biochemical data may further enhance individualized prediction and treatment precision. Ultimately, integrating interpretable ML tools into clinical workflows represents a decisive step toward data-driven, transparent, and personalized care in thyroid cancer management.