1. Introduction
Urolithiasis remains one of the most common reasons for emergency department visits involving acute flank pain, and its prevalence continues to rise worldwide [
1,
2]. While non-contrast computed tomography (CT) is the reference standard for diagnosis, concerns about radiation exposure, cost, and scanner availability have motivated the development of clinical prediction rules that might identify patients at sufficiently high or low probability of harboring a stone to justify modifying the imaging strategy [
3,
4,
5].
Several such scoring systems have been proposed, typically assigning points to a small number of binary variables derived from history, physical examination, urinalysis, and blood tests [
6,
7]. These instruments have demonstrated reasonable discriminative performance across different ED populations and offer the practical advantage of bedside applicability without computational tools. Both instruments convert continuous or ordinal measurements into binary indicators using predefined cutoffs, a step that simplifies bedside use but inevitably discards information.
The consequences of dichotomization in clinical prediction have been discussed extensively in the statistical literature [
8]. A creatinine value of 0.91 mg/dL and one of 0.93 mg/dL are treated identically when separated by a threshold of 0.92, even though their true association with stone probability may differ only marginally. Similarly, pain duration is often treated as a simple binary (<8 h vs. ≥8 h), despite the possibility that the relationship between symptom duration and stone probability is non-monotone. Such information loss is the price paid for simplicity, and whether the trade-off is worthwhile depends on how much predictive value is actually being sacrificed.
Machine learning (ML) models offer an alternative approach that can accommodate continuous variables and capture non-linear or interactive effects without requiring a priori cutoff selection [
9,
10]. However, the clinical utility of ML in this context is often questioned on two grounds. First, the discriminative improvement over well-designed scoring systems tends to be modest in moderate-sized single-center datasets. Second, the “black box” nature of many ML algorithms undermines the clinical transparency that scoring systems inherently provide [
11,
12]. Recent work has increasingly addressed this tension by pairing tree-based classifiers with Shapley-based explanation: van Doorn et al., for instance, developed a multicenter explainable ML model for rapid risk stratification of ED patients using routine laboratory data and SHAP-based interpretation [
13], and Moreno-Sánchez et al. applied XGBoost with SHAP to ED patient flow prediction on the MIMIC-IV dataset [
14].
Explainable artificial intelligence (XAI) techniques, particularly those grounded in Shapley value theory, have been increasingly adopted to address the interpretability concern [
15,
16]. By decomposing a model’s prediction into feature-level contributions for each individual patient, these approaches can answer clinically meaningful questions: not just “what is this patient’s probability of having a stone?” but also “which specific findings are driving that estimate, and in which direction?” The broader role of AI as a diagnostic support tool in clinically ambiguous presentations has been reviewed in other domains, including thyroid cytopathology [
17], and the urolithiasis-specific ML literature has expanded substantially in recent years, with work by Nedbal et al. on ureteroscopic outcomes [
18] and Kim et al. on deep-learning-based stone detection in ED CT images [
19]. These efforts have focused predominantly on imaging-based or interventional outcomes rather than on the pre-imaging, bedside diagnostic-reasoning problem that motivates our study. To our knowledge, this perspective has not been applied to urolithiasis diagnosis in the ED setting.
Beyond explainability, a question that existing prediction rules leave largely unanswered is how much additional diagnostic information each successive test contributes for a given patient. In routine ED practice, the diagnostic workup follows a roughly sequential pattern: history, physical examination, dipstick urinalysis, microscopy, and blood tests. Yet few studies have quantified the marginal information gain at each step or identified the point at which further testing no longer meaningfully reduces diagnostic uncertainty.
In this study, we pursued three related aims. First, we compared the discriminative performance of a gradient boosting classifier trained on continuous features against binary-thresholded counterparts and an established scoring system. Second, we used an interventional Shapley value approach to characterize how individual features contribute to predictions, with particular attention to non-linear effects that diverge from conventional dichotomization thresholds. Third, we developed a Shannon entropy-based sequential framework to quantify the marginal diagnostic value of each testing stage and to identify patients whose diagnostic uncertainty remains irreducible despite all available bedside information. Our intent is not to propose a replacement for existing clinical tools but rather to illustrate how explainable ML might complement them by offering a more granular, patient-level understanding of the diagnostic process.
2. Materials and Methods
2.1. Study Design and Population
This was a retrospective, single-center study conducted at the ED of Seoul St. Mary’s Hospital, a tertiary referral center affiliated with The Catholic University of Korea. We reviewed the records of consecutive adult patients (≥18 years) who presented with acute flank or abdominal pain suggestive of urolithiasis and underwent non-contrast CT between January 2014 and December 2015. A total of 1043 patient records were identified. After applying exclusion criteria (age < 18 years,
n = 1; missing urinalysis or blood chemistry data,
n = 33), 1009 patients remained eligible. Of these, 9 patients had missing values for the pain scale variable and were excluded, yielding a final cohort of 1000 patients (
Figure 1). Complete-case analysis was used because the proportion of missing data was small (0.9%) and the missing variable was a primary predictor rather than a covariate amenable to imputation without introducing systematic bias. The reference standard for urolithiasis was the presence of a ureteral or renal calculus on non-contrast CT as interpreted by board-certified radiologists. Patients without a directly visualized stone but with ipsilateral acute obstructive signs on CT—including hydronephrosis, perinephric stranding, or ureteral dilatation—were also classified as having urolithiasis, as these findings are recognized indicators of urinary tract obstruction and are clinically indistinguishable from cases in which the stone has passed prior to imaging. The study was approved by the Institutional Review Board (IRB) of Seoul St. Mary’s Hospital (KC26RASI0131) and was conducted in accordance with the Declaration of Helsinki. Informed consent was waived by the IRB due to the retrospective nature of the study.
2.2. Predictor Variables
Seventeen clinical and laboratory variables were selected based on their availability in routine ED evaluation and their inclusion in prior scoring systems. These comprised five history and demographic features (age, sex, prior stone history, pain duration in hours, pain scale), four physical examination findings (nausea, vomiting, costovertebral angle tenderness, body temperature), three dipstick urinalysis results (leukocyte esterase, specific gravity, occult blood), three microscopy findings (red blood cell count, white blood cell count, crystalluria), and two blood test results (serum creatinine, C-reactive protein). All continuous variables were retained in their original scale without dichotomization.
For comparison, two additional reference models were constructed using the same nine clinical variables in binary form, mirroring the structure of an established scoring system [
7]: one applying predefined dichotomization cutoffs (9-feature binary model) and one retaining the original continuous values (9-feature continuous model). This allowed us to isolate the contribution of feature continuity from that of feature selection.
2.3. Model Development
We trained a gradient boosting classifier (scikit-learn GradientBoostingClassifier, version 1.3) on the 17 features. Gradient boosting constructs an ensemble of shallow decision trees in a stage-wise additive fashion, with each successive tree fitted to the pseudo-residuals of the current ensemble. Unlike single decision trees, this approach can approximate complex non-linear and interaction effects while remaining more interpretable than deep neural networks. Hyperparameters were selected to balance model expressiveness against overfitting risk in a moderately sized, single-center dataset. Specifically, we used 200 boosting iterations with a conservative learning rate of 0.05 to allow gradual, stable convergence; a maximum tree depth of 3 to limit individual tree complexity; a minimum leaf size of 30 samples to prevent fitting to sparse subgroups; and stochastic subsampling of 80% of training instances and square-root feature selection per split to introduce regularizing variance across trees. No explicit grid search was performed; instead, hyperparameters were fixed a priori based on general gradient boosting literature, prioritizing stability over marginal performance gains.
Model performance was evaluated using five-fold stratified cross-validation. Predicted probabilities from the held-out folds were aggregated to compute the area under the receiver operating characteristic curve (AUC-ROC) with 95% confidence intervals derived from 1000 bootstrap samples [
20,
21]. Calibration was assessed visually using quantile-based calibration plots and numerically using the Brier score [
20]. The two 9-feature reference models described in
Section 2.2 were trained under identical cross-validation settings, enabling direct comparison of the effect of feature continuity and feature selection on discriminative performance.
Permutation importance was computed on the held-out test fold of each of the five cross-validation splits, using 30 permutation repeats per fold and a fixed random seed, and reported as the mean and standard deviation across folds. Computing the metric on held-out rather than in-sample data ensures that the reported importance reflects each feature’s contribution to out-of-sample discrimination rather than in-sample fit.
As a transparent baseline for comparison, we additionally trained a logistic regression classifier on the same 17 features under identical five-fold stratified cross-validation, with all continuous features standardized prior to fitting (zero mean, unit variance). This allowed us to quantify the incremental discriminative contribution of the gradient boosting model relative to a simple linear model fitted to the same inputs. Differences in out-of-fold AUC between the two models were assessed using a paired bootstrap procedure (2000 resamples) to obtain a confidence interval and p-value for the AUC difference.
2.4. Explainability Analysis
To explain individual predictions, we employed an interventional approach to computing Shapley-based feature contributions [
22]. Shapley values, originating from cooperative game theory, provide a theoretically principled decomposition of a model’s output into additive contributions from each input feature, satisfying properties of efficiency (contributions sum to the prediction), symmetry, and null player consistency [
15,
16,
23]. For each patient and each feature, the contribution was estimated by marginalizing over a reference distribution of 100 randomly sampled background observations: the feature of interest was replaced with background values while all other features were held at their observed values, and the mean change in predicted probability was recorded as that feature’s contribution. Summing contributions across all features reconstructs the difference between the individual’s predicted probability and the population mean. This interventional framing—sometimes termed the “do-calculus” SHAP—is preferred over the conditional expectation approach when features are correlated, because it avoids attributing variance to a feature through its statistical associations with other inputs, yielding contributions that are less driven by correlations among features and may better approximate each feature’s marginal contribution under the interventional SHAP framework [
24].
The 100-observation background sample was drawn by simple random sampling without replacement from the 1000-patient cohort, with a fixed random seed for reproducibility. A sensitivity check with alternative background sample sizes (50, 100, and 200) yielded essentially the same global feature ranking, confirming that the Shapley estimates were not sensitive to the specific background size within this range. Because the interventional formulation introduces some approximation relative to the exact tree-based computation, we quantified the gap between the sum of Shapley contributions and the model’s predicted probability on each patient. The average absolute reconstruction error across the cohort was 0.069, which we treat as the expected cost of the interventional formulation relative to the exact TreeSHAP algorithm. The practical consequence for interpretation—namely, that feature-value zero-crossings should be read as transition regions rather than as precise cut-points—is addressed in the Limitations section.
To identify non-linear feature effects that differ from conventional binary thresholds, we examined dependence plots relating each continuous feature’s value to its Shapley contribution across the cohort. Rolling mean trends were overlaid to smooth sample-level noise and to identify inflection points—values at which a feature’s contribution crosses zero or changes direction—that may diverge from the fixed cutoffs used in existing scoring systems.
Four representative patients were selected to illustrate the clinical relevance of individual-level explanations: (A) a classic stone presentation with high predicted probability, (B) a correctly low-probability non-stone case, (C) a diagnostically uncertain case with conflicting feature contributions, and (D) a stone mimicker with high probability but no confirmed stone on CT.
2.5. Sequential Information Gain Framework
To quantify the marginal diagnostic value of each testing stage, we adapted a Shannon entropy-based information gain framework [
20]. For each patient at each stage, diagnostic uncertainty was expressed as binary entropy H(p) = −p·log
2(p) − (1 − p)·log
2(1 − p), where p is the model-predicted probability of urolithiasis given the features available at that stage. Entropy is maximized at 1.0 bit when p = 0.5 (complete uncertainty) and approaches 0 as the prediction converges toward 0 or 1. The prior entropy H
0, computed from the observed cohort prevalence (85.0%), served as the baseline against which all subsequent entropy reductions were referenced (H
0 = 0.610 bits). This prevalence-anchored baseline ensures that marginal gains are interpreted relative to what a clinician would know before any testing, rather than against an uninformative 50% prior.
Five sequential testing stages were defined to mirror the typical ED workflow: Stage 1 (history and demographics), Stage 2 (adding physical examination), Stage 3 (adding dipstick urinalysis), Stage 4 (adding microscopy), and Stage 5 (adding blood tests). At each stage, a separate gradient boosting model was trained using only the features available up to that point. The marginal information gain of each stage was computed as the reduction in mean population-level entropy relative to the prior stage.
Each stage-specific model was trained with the same hyperparameters as the full 17-feature model, differing only in the feature subset available. Predicted probabilities at each stage were obtained from the held-out fold of the same five-fold stratified cross-validation used throughout the study, and all entropy calculations were performed on these held-out predictions. Because the stage-specific models are trained independently, the stage-to-stage changes in an individual patient’s predicted probability do not necessarily decompose into additive contributions from the newly added features alone; they also incorporate any re-weighting of earlier features induced by the expanded feature set. The interpretive consequences of this property are addressed in the Limitations section.
At the patient level, we defined “confident classification” as a predicted probability exceeding 90% (high-probability group) or falling below 20% (low-probability group) at any stage. Patients reaching these thresholds at an earlier stage were considered resolved and excluded from subsequent stages. The proportion of patients who did not reach either threshold after all five stages was taken as an estimate of the population for whom clinical and laboratory evaluation alone could not resolve diagnostic uncertainty.
2.6. Statistical Analysis
Continuous variables are presented as mean ± standard deviation or median (interquartile range) as appropriate. Categorical variables are presented as counts and percentages. AUC comparisons between the gradient boosting model and the established reference score were performed using DeLong’s test. All analyses were conducted in Python 3.10 using scikit-learn 1.3, NumPy 1.24, and pandas 2.0. A two-sided p-value < 0.05 was considered statistically significant.
4. Discussion
In this study, we applied an explainable machine learning framework to the emergency department diagnosis of urolithiasis and found that its primary value lies not in superior aggregate discrimination but in three complementary contributions: revealing non-linear feature effects that diverge from conventional cutoffs, providing transparent individual-level explanations for clinical predictions [
15], and quantifying the marginal information gain of sequential diagnostic testing [
25,
26].
The finding that the ML model did not demonstrate a clinically meaningful improvement over a well-designed scoring system is itself informative. Several methodological factors may explain this. The dataset of 1000 patients, while adequate for model training, may be insufficient for a gradient boosting classifier to fully exploit the added complexity of 17 continuous features relative to a well-calibrated binary score; simulations suggest that tree-based ML models require substantially larger training sets to consistently outperform simpler rule-based approaches in binary clinical outcomes [
27,
28]. Additionally, some of the continuous variables included—such as specific gravity and urine WBC—may carry relatively high measurement noise in routine ED settings, attenuating the expected benefit of retaining them on their original scale. These observations suggest that existing scoring systems already capture the predominant predictive signal in this population, and that the comparative advantage of ML in this context lies in interpretability and individualization rather than aggregate discrimination [
27].
Two additional comparisons support this interpretation. First, a logistic regression model trained on the same 17 continuous features achieved an out-of-fold AUC of 0.760 (95% CI, 0.713–0.803), and the paired bootstrap comparison against gradient boosting yielded ΔAUC = +0.012 (95% CI, −0.027 to +0.050; p = 0.54), which is not statistically significant. That a transparent linear baseline matches the gradient boosting model on aggregate discrimination reinforces the view that the incremental value of this framework in the present cohort lies elsewhere. Second, the 9-feature continuous model attained an AUC of 0.750, modestly below the 17-feature model’s 0.771 but well within the bootstrap confidence intervals of either model. The eight additional variables in the 17-feature set offered only marginal discriminative gain on their own, consistent with the observation that their signal-to-noise ratio in routine ED practice is modest relative to the core features. We present the 17-feature model because its richer feature set supports the Shapley-based non-linear and individual-level analyses that follow, not because it outperforms simpler alternatives on aggregate metrics.
The non-linear effects identified through Shapley analysis have several practical implications. The creatinine zero-crossing near 0.90 mg/dL, rather than 0.92, suggests that a modestly lower threshold might capture additional predictive information without meaningfully increasing false positives. The peak predictive window for pain duration at 2–5 h, rather than the blanket < 8 h cutoff, aligns with the typical timeline of ureteral peristaltic colic and may reflect the clinical reality that very short presentations (<1 h) are less discriminating because stone and non-stone etiologies both present acutely, while very long presentations (>16 h) may indicate a complication or alternative diagnosis [
29]. Most notably, the strong negative contribution of CRP at levels above 3 mg/dL, which reduced stone prevalence from 86.1% to 45.5%, highlights a clinically recognizable but formally unaccounted-for signal. In our cohort, among the 11 non-stone patients with CRP above 3 mg/dL, the most common alternative findings included urinary tract infection or cystitis (
n = 3), pelvic inflammatory disease (
n = 1), pyelonephritis (
n = 1), and benign prostatic hyperplasia with suspected prostatitis or cystitis (
n = 2). This distribution is consistent with the interpretation that markedly elevated CRP should shift clinical suspicion toward infectious or inflammatory mimickers rather than uncomplicated stone disease, echoing prior observations that infectious etiologies are among the most common alternative diagnoses in patients initially suspected of urolithiasis [
30].
The individual patient explanations (
Figure 4) illustrate a distinct mode of clinical reasoning that numerical scores do not support. When a patient falls into an intermediate risk stratum on a conventional scoring system, the clinician knows the probability range but little else. The Shapley waterfall plot, by contrast, reveals whether the intermediate probability results from uniformly mild evidence or from strong conflicting signals, a distinction with direct implications for next steps. Case C exemplifies this: the opposing contributions of elevated creatinine and elevated CRP make explicit the clinical tension between stone-suggestive and infection-suggestive findings, flagging the patient for definitive imaging rather than additional bedside tests.
The sequential information gain analysis provides a quantitative basis for a question that clinicians navigate intuitively but rarely formalize: how much does each additional test contribute for this patient? Our finding that dipstick urinalysis provided the largest marginal gain among non-history stages, while microscopy added comparatively little incremental value, is consistent with the general clinical intuition that dipstick hematuria is one of the most informative bedside findings in suspected urolithiasis [
31]. The minimal incremental gain from microscopy over dipstick (ΔAUC = 0.008, entropy reduction 1.3%) raises a clinically meaningful question for resource-constrained ED settings: whether routine urine microscopy adds sufficient diagnostic information to justify its time and cost in all patients with suspected urolithiasis, or whether it could be safely reserved for those with equivocal dipstick findings or intermediate predicted probability. This preliminary observation merits prospective investigation.
The asymmetric entropy trajectories between stone and non-stone patients deserve emphasis. In a population with 85% stone prevalence, the prior already strongly favors a stone, making confirmation relatively straightforward while exclusion requires overcoming this prior with consistently negative evidence. This structural asymmetry has direct implications for the 20% low-probability threshold: because the pre-test probability is so high, even a model prediction below 20% still carries a non-trivial stone rate (observed stone prevalence in OOF-predicted <20% group: 20.0%,
n = 10), and cannot be used as a clinical rule-out. The 20% boundary was chosen to represent a theoretically meaningful probability reduction from the 85% prior, not to define a clinically safe discharge threshold. Projected predictive values across a range of alternative baseline prevalences are presented in
Supplementary Table S3; external application of this framework requires recalibration to the local prevalence of the target population. We therefore frame this framework not as a triage rule but as a reasoning aid: a way of making explicit which features are driving a prediction, how much uncertainty remains after each testing stage, and where the points of diagnostic tension lie for a given patient. Whether the 90% and 20% thresholds or any other decision boundaries are clinically actionable is a separate question that depends on the local prevalence, the consequences of misclassification, and prospective validation, none of which this study can answer.
To illustrate how the framework’s behavior would change in populations with more typical ED stone prevalence, we projected the positive and negative predictive values of the 0.90 and 0.20 thresholds across a range of baseline prevalences using the observed sensitivities and specificities in our held-out predictions. At the cohort’s observed prevalence of 85%, the positive predictive value of a predicted probability ≥ 0.90 was 94.1% and the negative predictive value of a predicted probability ≤ 0.20 was 86.7%. At a hypothetical 40% prevalence (closer to an unselected ED flank-pain population), the positive predictive value would fall to 65.3% while the negative predictive value would rise to 98.2%; at 30% prevalence, the corresponding values would be 54.8% and 98.8% (
Supplementary Table S3). This asymmetry is informative: the difficulty of exclusion observed in our cohort is in large part a structural artifact of the extreme prior probability, and in lower-prevalence populations the same threshold structure would be expected to behave very differently. We emphasize that these projections are analytical extrapolations rather than direct evidence; prospective validation in an unselected population is required before any threshold can be proposed for clinical use, and the small number of patients in the ≤0.20 bin of our cohort limits the precision of the underlying specificity estimate.
This study has several limitations that should be acknowledged. First, the gradient boosting model, despite achieving a statistically higher AUC by DeLong’s test (0.771 vs. 0.723,
p = 0.001), did not demonstrate a clinically meaningful improvement in aggregate discrimination over the reference scoring system, suggesting that the added complexity of ML is unlikely to be justified on discrimination grounds alone in this setting. Second, it is a single-center retrospective analysis, and the results require external validation before any clinical application can be considered. The high stone prevalence (85.0%) reflects the selection of patients who underwent non-contrast CT at a tertiary referral center; this is substantially higher than estimates from unselected ED flank pain cohorts, where stone prevalence typically ranges from 30 to 50%. At lower baseline prevalence, the positive predictive value of the high-probability threshold would diminish markedly, and the baseline entropy would be closer to its theoretical maximum of 1.0 bit, offering greater potential for information gain. The prior entropy of 0.610 bits observed in this study therefore reflects a high-prevalence population in which meaningful entropy reduction is already structurally limited. Furthermore, nine patients with missing pain scale data were excluded via complete-case analysis; a comparison of these patients against the included cohort (
Supplementary Table S2) showed that they differed systematically on several characteristics, most notably having a lower observed stone prevalence (44.4% vs. 85.0%,
p = 0.006) and an older age distribution, a pattern consistent with atypical presentations in which pain was not the dominant chief complaint. Given the small proportion affected (0.9%), the impact on the trained model is likely modest, but we note this difference explicitly rather than treating the missingness as uninformative. Third, the Shapley values were computed using an interventional marginalization approach, introducing an average reconstruction error of 0.069 between the sum of Shapley contributions and the model’s predicted probability. This has a practical consequence for regions in which a feature’s Shapley contribution is small in absolute magnitude—which is precisely the case near the zero-crossings we identified. For borderline values such as creatinine around 0.90 mg/dL or pain duration in the 2–5 h window, the precise location of the inflection point should therefore be read as an indicative transition region rather than a precise cut-point, and formal threshold determination for clinical use would require exact TreeSHAP and bootstrap confidence intervals on the crossing location. Fourth, we did not account for potential temporal trends in practice patterns or changes in CT utilization over the study period. Fifth, the stage-specific models in the sequential framework were trained independently at each stage, which means that adding a new testing result at a later stage can implicitly re-weight the contributions of earlier features. The marginal information gain we attribute to each stage therefore reflects the total entropy reduction associated with that stage, which includes both the direct contribution of the newly added features and any re-weighting of earlier features that those additions induce. A fully decomposed attribution—separating the independent contribution of each stage from cross-stage interaction effects—would require a different formulation, such as conditional Shapley values applied to the full model, and should be pursued in future work. Finally, as Case D illustrates, any prediction model is bounded by the information content of its inputs; when all measurable features resemble stone, no statistical method can distinguish a true stone from a perfect mimicker without imaging.