3.5.1. Model Performance Comparison
All ML models outperformed previous published conventional logistic regression baselines (AUC~0.80 [
18]). Under rigorous repeated cross-validation, two classifiers were statistically tied at the top of the ranking: L1-regularized logistic regression (LASSO; AUC = 0.941 ± 0.041) and random forest (0.933 ± 0.065), and L2-regularized logistic regression (0.930 ± 0.049), XGBoost (0.927 ± 0.074), neural network (0.924 ± 0.062), and SVM with RBF kernel (0.923 ± 0.063) followed within a tight band, while gradient boosting (0.908 ± 0.069) and KNN (0.791 ± 0.090) trailed (
Table 3). Paired DeLong tests on the concatenated out-of-fold predictions confirmed that none of the 15 pairwise contrasts among the top six classifiers reached statistical significance (all
p > 0.05); the LASSO–random forest contrast specifically yielded z = −0.18,
p = 0.85. Random forest was significantly more discriminative than KNN (z = 2.74,
p = 0.006) and borderline superior to gradient boosting (z = 1.97,
p = 0.048). Given that the AUC differences among the top six classifiers (0.923–0.941) were smaller than their respective cross-validation standard deviations (0.041–0.074), these models were statistically indistinguishable in discriminative performance. Random forest was retained as the headline classifier in subsequent analyses because it provides the nonlinear interaction discovery and SHAP-based feature decomposition that support the manuscript’s interpretive findings (
Section 3.5.2 onward); the LASSO numerical equivalence is reported here to enable transparent comparison.
Table 3.
Machine learning model performance (5-fold CV × 10 repeats).
Table 3.
Machine learning model performance (5-fold CV × 10 repeats).
| Model | AUC-ROC | F1 Score | Brier Score |
|---|
| L1-Logistic (LASSO) | 0.941 ± 0.041 | 0.836 ± 0.082 | 0.100 ± 0.038 |
| Random Forest | 0.933 ± 0.065 | 0.865 ± 0.081 | 0.111 ± 0.034 |
| L2-Logistic | 0.930 ± 0.049 | 0.847 ± 0.069 | 0.102 ± 0.036 |
| XGBoost | 0.927 ± 0.074 | 0.877 ± 0.085 | 0.098 ± 0.047 |
| Neural Network | 0.924 ± 0.062 | 0.824 ± 0.103 | 0.117 ± 0.063 |
| SVM (RBF) | 0.923 ± 0.063 | 0.853 ± 0.079 | 0.114 ± 0.035 |
| Gradient Boosting | 0.908 ± 0.069 | 0.816 ± 0.085 | 0.153 ± 0.073 |
| KNN (k = 7) | 0.791 ± 0.090 | 0.569 ± 0.155 | 0.188 ± 0.038 |
Training stability was verified through three complementary diagnostics, as summarized in
Figure 3. First, random forest out-of-bag (OOB) error declined sharply over the first ~50 trees and stabilized after approximately 200 trees at ~12.2% (
Figure 3A); no further improvement was observed beyond 200 trees, confirming that the chosen value of 500 trees was past the plateau and not over-fit to a particular ensemble size. Secondly, XGBoost training and validation log-loss curves over 300 boosting rounds tracked closely throughout, with both curves declining monotonically and the validation curve reaching its minimum at round 299 (essentially the end of the budget) at a log-loss = 0.272; the train/validation gap at this minimum was −0.017 log-loss units (
Figure 3B). The absence of a widening train–validation gap argues against over-fitting at the chosen regularization (max_depth = 3, learning_rate = 0.03, reg_lambda = 5, reg_alpha = 1, subsample = 0.85, colsample_bytree = 0.85; see
Section 2.4.2 for the rationale behind these stability-driven hyperparameters). Finally, a learning curve constructed by progressively increasing the training fraction (from 30% to 90% of the cohort, in 10% increments, each evaluated by 5-fold CV) showed mean AUC oscillating in a narrow band of 0.91–0.94 across all training fractions and asymptoting at approximately 0.929 (
Figure 3C), with confidence-interval width contracting as the training set grew—indicating that the marginal benefit of additional training data within this cohort has begun to saturate but that additional data would still tighten the estimates. The AUC distribution across the 50 cross-validation folds (
Figure 3D) was concentrated above 0.85 with a left tail extending to 0.73 (random forest: mean 0.933, median 0.950, IQR [0.887, 0.988], min 0.725, max 1.000), without bimodality or extreme outliers, supporting the stability of the reported mean estimate.
To contextualize the multi-feature ML models, a univariate logistic regression using ALP alone was evaluated under the identical 5-fold CV × 10 repeat protocol, yielding a cross-validated AUC of 0.958 ± 0.041. This univariate performance exceeded all eight multi-feature models, suggesting that the additional 24 features provided no net discriminative benefit beyond ALP in this cohort—a finding consistent with ALP’s overwhelming SHAP dominance (
Section 3.5.2). While this does not negate the clinical value of multi-feature models for risk calibration and subgroup analysis, it highlights that the observed ML performance is driven almost entirely by a single biomarker rather than complex feature interactions, and that the multi-feature models may be subject to noise accumulation given the low events-per-variable ratio (EPV ≈ 1.7).
ROC curves for all eight algorithms are shown in
Figure 4A, with the corresponding AUC-ROC rankings and standard deviations displayed in
Figure 4B. Precision–recall analysis confirmed robust performance across models, with the top four classifiers all achieving average precision above 0.90 (
Figure 4C). Calibration assessment revealed that the L1-logistic and random forest models tracked the diagonal most closely, indicating well-calibrated predicted probabilities relative to the observed HBS rates (
Figure 4D).
At the optimal Youden threshold (0.647), the random forest achieved 85.4% sensitivity, 95.9% specificity, 94.6% positive predictive value, and 88.7% negative predictive value, with an overall accuracy of 91.1% (see below—
Section 3.5.4). The near-equivalence of the top four models—spanning linear and nonlinear algorithms—suggests that ALP’s dominant signal is sufficiently strong to be captured by simpler classifiers, while tree-based and kernel methods offer only marginal additional performance through modeling interactions and nonlinearities.
The four pre-specified criteria were applied to the eight benchmarked classifiers and to the univariate-ALP baseline (
Table 4). On AUC, the LASSO L1-logistic regression ranked first (0.941), the random forest second (0.933), and the L2-logistic regression and XGBoost third and fourth (0.930 and 0.927), respectively. On Brier score, the order was reversed: XGBoost ranked first (0.098), the L1-logistic regression second (0.100), the L2-logistic regression third (0.102), and the random forest fourth (0.111). On F1 at the Youden-optimal threshold, XGBoost again ranked first (0.877), with the random forest second (0.865), and the L1-logistic regression fourth (0.836). On decision-curve net benefit at threshold 0.50, the univariate-ALP baseline (0.367) outperformed every multi-feature model including the random forest (0.344).
The criteria therefore did diverge, in two notable directions. First, no single classifier dominated all four criteria simultaneously. XGBoost was the best-calibrated model and the best on F1, but ranked fourth on AUC; the LASSO L1-logistic regression was the best on AUC, but its calibration was indistinguishable from the L2-logistic and worse than XGBoost; the random forest was second on both AUC and F1 but ranked fourth on Brier score. Second, on the clinical-utility criterion (net benefit at pt = 0.50), the univariate-ALP baseline outperformed every multi-feature model, consistent with the decision-curve and NRI findings reported in
Section 3.5.7.
We retained the random forest as the headline multi-feature classifier on the basis of the two tiebreaker considerations specified in
Section 2.4.6: (a) on the training-stability diagnostics in
Figure 3, the random forest exhibited the smallest train-validation gap (the OOB error stabilized cleanly after 200 trees with no signs of over-fit, and the learning curve in
Figure 3C asymptoted within a narrow 0.91–0.94 band), whereas the XGBoost diagnostic curves in
Figure 3B required tight regularization to suppress the train-validation gap at this events-per-variable ratio; (b) the random forest’s recursive-partitioning architecture supports the SHAP interaction decomposition reported below (in
Section 3.5.2 and
Figure 5 and
Figure 6) in a manner that is consistent with the manuscript’s interpretive findings. We acknowledge that on strict calibration grounds, XGBoost would have been the preferred classifier, and that on strict discrimination grounds, the LASSO L1-logistic regression would have been preferred. This caveat does not affect the principal claims of the manuscript: none of the eight multi-feature classifiers outperformed univariate ALP on either AUC or decision-curve net benefit at pt = 0.50, and the random forest—chosen for direct head-to-head NRI testing as the headline multi-feature model—did not outperform univariate ALP on net reclassification improvement (
Section 3.5.7).
3.5.4. Clinical Decision Support and Risk Stratification
At the optimal Youden threshold (0.647), the random forest confusion matrix demonstrated 47 true negatives, 2 false positives, 6 false negatives, and 35 true positives (
Figure 7A). Threshold optimization curves (
Figure 7B) confirmed that this cut-point maximized the balance between sensitivity (85.4%) and specificity (95.9%), with the F1 score peaking at 0.87. Cross-validated random forest predicted probabilities stratified patients into four risk tiers (
Figure 7C): low (<25%,
n = 34, observed HBS 8.8%), moderate (25–50%,
n = 16, 18.8%), high (50–75%,
n = 24, 79.2%), and very high (>75%,
n = 16, 100%). This calibration demonstrates that ML-predicted probabilities closely track the actual outcomes and could directly inform preoperative calcium management intensity.
To clinically contextualize model errors, we performed a case-by-case adjudication of all misclassified patients at the Youden-optimal threshold of 0.647 (random forest, out-of-fold predictions), comprising 2 false positives and 6 false negatives (
Appendix A Table A2). Both false-positive cases had preoperative ALP near the upper boundary of the inflection zone (284 and 292 U/L) and underwent SPTX; their postoperative calcium nadirs were 8.2 and 8.6 mg/dL—only marginally above the 8.0 mg/dL HBS threshold in the first case and clearly above it in the second. These cases sit on the inflection slope where ALP-driven HBS risk transitions sharply, and the model’s elevated probability assignment (0.701 and 0.667) is clinically defensible: it would, if anything, have prompted appropriately intensive prophylactic calcium supplementation. Among the 6 false negatives, four had preoperative ALP in the lower-intermediate range (155–256 U/L) but developed HBS—likely reflecting determinants not captured in the preoperative feature set (e.g., concurrent vitamin-D status, magnesium balance, or intraoperative variables such as resected gland weight, none of which were available in the dataset). The remaining two false negatives had an ALP near the inflection zone (274 and 282 U/L) and predicted probabilities of 0.244 and 0.407—substantially below the 0.647 threshold despite progression to HBS.
Threshold selection therefore involves an explicit trade-off between sensitivity (avoiding under-treatment of patients who progress to HBS) and specificity (avoiding unnecessary intensive prophylaxis in patients who would not). Operating characteristics across the clinically relevant threshold range are tabulated in
Appendix A Table A5 and summarized in two illustrative scenarios. Under a high-sensitivity scenario (threshold 0.30), the random forest classifier achieved 87.8% sensitivity and 75.5% specificity, with 12 false-positive predictions among the 49 HBS-negative patients; this corresponds to over-treating approximately one in four non-HBS patients in exchange for capturing 36 out of 41 true HBS events. Under a high-specificity scenario (threshold 0.60), the same model achieved 85.4% sensitivity and 95.9% specificity, with only two false positives but the same six false negatives. The Youden-optimal cut-point of 0.647 reported in
Section 3.5.4 represents the point at which specificity reached 100% in our cohort while sensitivity was held at 85.4%—an operating point that prioritizes specificity. Notably, lowering the threshold below 0.30 yielded no additional sensitivity (95.1% at 0.20–0.25 vs. 87.8% at 0.30) but caused the specificity to fall sharply (51.0% at 0.20), so very-low-threshold operating points provide a poor risk–benefit trade-off in this cohort. This case-level review supports the conclusion that residual model errors are driven by unmeasured biological factors rather than reflecting model instability.
SHAP and permutation importance rankings converged in confirming ALP’s dominance, with SHAP placing it first by a factor of 6.5-fold and permutation importance by approximately 16-fold over the respective next-ranked features (
Figure 7D). An ALP threshold sweep analysis (
Figure 7E) further illustrates that the conventional 300 U/L cut-point achieved near-perfect sensitivity (100%) for HBS detection, though specificity continued to improve at higher thresholds, suggesting that the 250–300 U/L zone warrants individualized clinical attention.
Guided by SHAP-derived feature importance, a pragmatic composite bedside risk score (range 0–9) was constructed from five clinically accessible variables: ALP category (0–3 points: ≤200 = 0, 201–300 = 1, 301–400 = 2, >400 = 3), severe bone pain (2 points), TPTX (1 point), PTH level (1 point if >2000 pg/mL, additional 1 if >3000), and elevated creatinine > 10 mg/dL (1 point). This composite score achieved an AUC of 0.883, with monotonically increasing observed HBS rates from 0% at score 0 to 100% at scores ≥ 6 (
Figure 7F). As this score was both derived and evaluated on the same cohort—with point weights selected post hoc from the SHAP rankings on this dataset—these performance estimates, including the apparent calibration, are likely optimistic; independent prospective validation is an essential prerequisite before clinical adoption.
3.5.5. Novel Insights and Subgroup Generalization
The nonlinear ALP–HBS relationship was further characterized with 95% Wilson confidence intervals (
Figure 8A), confirming the steep transition from near-zero risk below 200 U/L to near-certain risk above 350 U/L. SHAP dependence analysis stratified by surgical approach revealed that TPTX amplifies ALP-mediated HBS risk, with a steeper SHAP gradient at lower ALP values in TPTX patients compared with SPTX (
Figure 8B), consistent with more abrupt PTH withdrawal producing greater skeletal calcium hunger.
The granular 4 × 4 PTH × ALP risk heatmap (
Figure 8C) confirmed that patients with ALP > 300 U/L had 100% HBS rates across all PTH categories, while those with ALP < 150 U/L had 0% regardless of PTH level. Cross-validated subgroup analysis was performed across the clinically relevant strata shown in
Figure 8D. In adequately-sized subgroups, the random forest maintained AUC ≥ 0.84: TPTX-only (AUC = 0.981;
n = 41, 23 HBS events), SPTX-only (0.837;
n = 49, 18 events), PTH > 2000 pg/mL (0.958;
n = 38, 18 events), and younger (≤55 years, 0.962;
n = 43, 21 events) vs. older patients (>55, 0.865;
n = 47, 20 events). These subgroup AUCs should be interpreted as descriptive rather than confirmatory, since cohorts of <50 patients are subject to substantial fold-to-fold variance under 5-fold cross-validation. The apparent sub-chance AUC of 0.240 in the ALP ≤ 200 U/L stratum (
Figure 8D) is a corresponding artifact of extreme class imbalance: this subgroup contains only 1 HBS event among 26 patients, so the AUC reduces to the relative rank of a single positive case against 25 negatives and is statistically ill-defined; it should not be interpreted as evidence of model failure in this stratum, where the clinically relevant observation is that HBS incidence is uniformly very low (1/26, 3.8%). External validation in larger subgroup-specific cohorts is required before any subgroup-level conclusions can be drawn.
3.5.7. Clinical-Utility Comparison: Decision-Curve Analysis and Net Reclassification Improvement
Net benefit across the threshold range 0.20–0.70 is reported in
Table 5 and plotted in
Figure 9. The two curves crossed at a threshold of approximately 0.55: at lower thresholds (the high-sensitivity regime), univariate ALP yielded greater net benefit than the multi-feature random forest, whereas at higher thresholds (the high-specificity regime), the random forest yielded the greater net benefit. At the clinically plausible threshold of 0.30 (high-sensitivity scenario from
Section 3.5.4), net benefit was 0.376 for univariate ALP, 0.343 for the random forest, 0.321 for the composite bedside score, and 0.222 for treat-all (treat-none was 0 by construction at this threshold). At a threshold of 0.50 (balanced scenario), net benefit was 0.367, 0.344, 0.289, and −0.089, respectively; the negative value for treat-all reflects the fact that at this threshold, indiscriminately treating every patient produces more harm (in the form of unnecessary supplementation in non-HBS patients) than benefit. At a threshold of 0.60 (high-specificity scenario), the ranking inverted: net benefit was 0.356 for the random forest, 0.278 for univariate ALP, and 0.261 for the composite score. At pt = 0.70, the random forest retained a similar advantage (0.344 vs. 0.289 for ALP). The composite bedside score tracked between the random forest and treat-all curves, retaining clinically meaningful net benefit but underperforming both univariate ALP and the random forest at every threshold examined. These findings indicate that the choice between univariate ALP and the multi-feature random forest depends on the operating point: simple stratification on preoperative ALP is preferable when the goal is high sensitivity (avoiding missed HBS cases), whereas the multi-feature random forest is preferable when the goal is high specificity (avoiding unnecessary intensive prophylaxis).
Using the four pre-specified risk tiers (low, moderate, high, very high), the categorical NRI of the multi-feature random forest vs. the univariate-ALP baseline was −28.1% (95% CI not estimated due to small-sample non-parametric variance), with an event-NRI of −22.0% and a non-event-NRI of −6.1%. Of the 41 HBS+ patients, the random forest reclassified 3 (7.3%) into a higher-risk tier and 12 (29.3%) into a lower-risk tier compared with univariate ALP; among the 49 HBS− patients, 8 (16.3%) were reclassified upward and 5 (10.2%) downward. The continuous NRI was −67.2% (events: 31.7% up, 68.3% down; non-events: 65.3% up, 34.7% down), which corroborates the categorical finding that the multi-feature pipeline reclassifies a substantial fraction of true HBS+ patients into lower probability bins relative to univariate ALP. In other words, the multi-feature model is not merely non-superior to univariate ALP—by the strict criterion of NRI, it is mildly inferior at moving cases in the clinically helpful direction.
The DCA and NRI results converge with the AUC comparison reported in
Section 3.5.1 on a more nuanced conclusion than “no benefit”: the additional 23 features beyond ALP do not improve clinical utility in the high-sensitivity regime that is most relevant to HBS prophylaxis (where the principal harm is a missed case), but they do confer a modest net-benefit advantage at high-specificity operating points (pt ≥ 0.60). On NRI—a metric that does not depend on a chosen operating threshold—the multi-feature pipeline is mildly inferior to univariate ALP. This pattern is internally consistent with the SHAP analysis (
Section 3.5.2), in which ALP exceeded the next-ranked feature by a factor of 6.5-fold in mean absolute SHAP magnitude: the multi-feature pipeline can re-rank patients in the upper probability range (where ALP alone is already saturating near 100% predicted risk), but it cannot extract additional discriminative signal from features that contribute little to the outcome. We retained the multi-feature pipeline in the final manuscript not because it strictly dominates univariate ALP, but because (i) the SHAP and partial-dependence outputs characterize the nonlinear ALP–HBS dose–response and the TPTX × ALP interaction in ways that single-variable logistic regression cannot, and (ii) the SHAP-guided composite bedside score (range 0–9), although discriminatively inferior to univariate ALP, is more easily computable at the bedside than a logistic regression coefficient applied to a continuous biomarker, and may therefore be preferred in clinical workflows that require a discrete integer score. The honest framing of the multi-feature framework is therefore as an interpretive and pedagogical adjunct to univariate ALP—and as a tool for the high-specificity operating regime—not as a strict replacement.