Next Article in Journal
Emerging Trends in Otorhinolaryngology, Hearing, and Balance Medicine for 2026
Previous Article in Journal
Cochlear Implantation in Narrow Duplicated Internal Auditory Canal: Case Report and Systematic Review
 
 
Article
Peer-Review Record

Comparing Methods for Uncertainty Estimation of Paraganglioma Growth Predictions

J. Otorhinolaryngol. Hear. Balance Med. 2026, 7(1), 3; https://doi.org/10.3390/ohbm7010003
by Evi M. C. Sijben 1,2, Vanessa Volz 2, Tanja Alderliesten 1, Peter A. N. Bosman 2, Berit M. Verbist 3, Erik F. Hensen 4 and Jeroen C. Jansen 4,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
J. Otorhinolaryngol. Hear. Balance Med. 2026, 7(1), 3; https://doi.org/10.3390/ohbm7010003
Submission received: 20 October 2025 / Revised: 15 December 2025 / Accepted: 28 December 2025 / Published: 6 January 2026
(This article belongs to the Section Head and Neck Surgery)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors
  1. Summary of the Manuscript

This study compares two different ways of estimating uncertainty when predicting how head and neck paragangliomas grow over time. The authors evaluate:

A historical method, based on past prediction errors.

A Bayesian method, which incorporates uncertainty from the growth model and from AI-based tumor segmentation.

The analysis includes 208 patients, 311 tumors, and 1501 measurements, leading to over 2500 individual predictions. Three Gompertz-like models are used to describe tumor growth.

Overall, the historical method provides wide but well-calibrated confidence intervals, while the Bayesian approach gives tighter intervals and, importantly, better distinguishes between tumors that will grow and those that will remain stable.

  1. Strengths of the Study

A solid and well-assembled dataset

The authors work with a substantial dataset, with many tumors having multiple follow-up scans. This gives credibility to the modeling work and strengthens the conclusions.

Clear clinical motivation

Paragangliomas often grow slowly, and predicting which tumors will actually progress is extremely important for scheduling follow-up MRI scans and avoiding unnecessary treatments. The paper addresses this real clinical need very well.

Thoughtful methodological design

The comparison between historical and Bayesian uncertainty estimation is well conceived.

The use of a Monte Carlo Bayesian approach and the attempt to integrate segmentation uncertainty is technically sound and clearly explained.

Effective figures and examples

The clinical examples (especially Figures 5 and 6) are particularly helpful. They show how prediction-based risk categories could influence practical decisions about follow-up timing.

Good potential for future clinical application

The idea of using uncertainty to classify tumors into “likely to grow” versus “likely to stay stable” is appealing and easier to integrate into clinical workflows than a raw confidence interval.

  1. Main Issues and Suggestions   1.Need for external validation

All results come from a single institution, using similar MRI scanners and protocols.

To judge whether these findings generalize, we need to see how the model performs on data from other centers.

  1. Auto-segmentation uncertainty ≠ clinical uncertainty

The uncertainty estimated from the nnU-Net ensemble reflects model variability, but not the variability between radiologists.

Human inter-observer variability is often larger and more clinically meaningful.

Comparing model-based uncertainty with radiologist-based variability would strengthen the study considerably.

  1. Growth models are restrictive

All three models enforce smooth, S-shaped growth curves. In real life, paragangliomas sometimes: remain stable for a long period and then accelerate, grow linearly, or very rarely shrink.

These patterns cannot be captured by the current models. Even if rare, acknowledging and testing for such deviations would improve the robustness of the conclusions.

  1. The 20% threshold for “growth” needs updating

The cutoff comes from a study more than 20 years old. Given modern volumetric techniques, a more contemporary justification—or at least a sensitivity analysis—would help.

  1. Many predictions remain “unclassified”

Although the Bayesian method performs best, a large number of predictions fall into the “uncertain” category (especially when only one or two measurements are available).

This limits immediate clinical usefulness and should be discussed more explicitly.

  1. Missing clinical predictors

Important clinical factors that influence growth—such as SDHB/SDHD mutation status, tumor location, and age—are not included in the models.

Adding such variables could substantially increase prediction accuracy.

  1. No comparison with radiologist performance

The paper argues that volumetry may detect changes earlier than diameter-based assessment, but this is not formally tested. A small comparison study, even retrospective, would underline the practical value of the method.

  1. Section-by-Section Suggestions 

Introduction

Add more context on genetic and anatomical factors influencing paraganglioma growth.

Explain the choice of Gompertz-like curves in more intuitive, biological terms.

Methods

Include MRI acquisition parameters.

Better justify the Bayesian prior sampling strategy.

Clarify the rationale for fixing the 20% growth threshold.

Results

Provide confidence intervals for percentages in Tables 1 and 2.

Indicate how many tumors had only one or two data points.

Enhance readability of heatmaps (consistent scales, clearer labels).

 

Discussion

Compare findings to other tumor-growth modeling frameworks.

Expand the discussion on MRI geometric distortion and segmentation error.

Offer a clearer vision of how these methods could be tested in a clinical setting.

Conclusions

Explicitly state that the model is promising but still requires external validation before clinical adoption.

  1. Overall Recommendation

Major Revision

The manuscript is strong, original, and addresses a highly relevant clinical question.

However, several methodological clarifications and additional analyses are needed to make the work fully convincing and ready for publication.

 

  1. Short Reviewer Summary

This is an interesting and well-constructed study. The authors tackle a meaningful clinical challenge—predicting which paragangliomas will grow—and offer a thoughtful comparison between two uncertainty estimation strategies. The Bayesian approach, especially when combined with tumor-specific segmentation uncertainty, appears promising.

However, the paper would benefit from deeper discussion of limitations, especially regarding model generalization, segmentation variability, and the rigid nature of the growth models. With revisions, this work has the potential to become a valuable contribution to radiologic AI research

Author Response

Please see the attachment. 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I am grateful to the editor for giving me the opportunity to evaluate this interesting work. These are my comments that I have divided into sections that correspond to the manuscript structure:

A) Abstract

1. The abstract contains some unclear phrasing. For instance, the statement the tumor volume fell only 78% of the time within its estimated 95% confidence interval could be misunderstood without context. Readers might not immediately realize this implies the Bayesian intervals are miscalibrated (in other words, the interval does not cover the true value 95% of the time as expected).


Suggestions for improvement: Revise the wording for clarity and accuracy. Use precise scientific terminology and language style when describing the Bayesian 95% credible interval, which contained the true volume only 78% of the time, by explicitly noting that this indicates underestimation of uncertainty (i.e. the obtained Bayesian 95% credible intervals were too narrow). This will ensure the key result about calibration is immediately clear.

2. The abstract is somewhat lengthy and blends background, methods, results, and conclusions. While it is informative, it might overwhelm readers with details that contain multiple percentages and numeric results. For instance, listing numerous statistics (94%, 78%, 86% etc) in a short span can be hard to digest.

Suggestions for improvement: Consider streamlining the abstract to focus on the most critical findings. Present the results in a balanced way by emphasising the trade-off observed between the historical method, which gives very wide but well-calibrated intervals, and the Bayesian method, which gives narrower but under-calibrated intervals. It may also help the readers if the authors explicitly define why the Bayesian method is deemed most useful despite its low 78% coverage (Please highlight that the Bayesian method’s superior ability to distinguish growing versus stable tumors is the primary rationale.) Ensuring the abstract has a clear takeaway will make it more effective in capturing reader’s interest and attention.

B) Introduction

1. Clarity and Grammar: The introduction contains a few minor grammatical issues and could be phrased more clearly in places. For example, the sentence This study focuses on both historical and Bayesian methods to estimating uncertainty is ungrammatical and therefore it should be rectified to methods for estimating uncertainty.


Suggestions for improvement: Proofread and revise sentences for clarity and correct grammar. In the example above, use methods for estimating uncertainty

2. Context and Literature: While the introduction lays out the clinical problem and the need for growth prediction, it could better acknowledge relevant prior work on uncertainty quantification in tumor growth modeling. The authors reference their own prior study and a few clinical studies, but they do not mention other existing approaches to predictive uncertainty. For instance, recent studies in oncology have explored Bayesian methods for tumor growth and calibration of prediction models. Notably, general frameworks for Bayesian tumor growth modeling and uncertainty have been described (for example, in mechanistic modeling or population modeling contexts), which the current manuscript does not cite.


Suggestions for improvement: Strengthen the literature review by including a brief discussion of related work on uncertainty in tumor growth predictions. For example, mention that Bayesian calibration techniques have been applied to tumor growth models in other settings to improve predictive accuracy (2). Citing a recent review or study on head and neck paraganglioma management could also provide clinical context (for instance, noting the typical slow growth rates and the rationale for wait-and-scan management (1)). This will position the study within the broader research landscape and highlight its novelty.

3. Objectives and Scope: The introduction lists multiple aims: 1) comparing historical versus Bayesian uncertainty estimation, 2) incorporating auto-segmentation uncertainty (tumour-specific versus general approaches), using multiple growth models, and demonstrating the use of uncertainty for risk stratification. While these aims are all mentioned, the narrative jumps quickly from one to the next. It may not be immediately clear to the reader how these pieces fit together or which is the primary research question. For instance, are the authors mainly trying to find the best method for uncertainty quantification, or to show the value of adding segmentation uncertainty, or to advocate for multi-model ensembles? The text could better synthesize these points.


Suggestions for improvement: Refine the final part of your introduction to clearly state the study’s primary research question and research hypothesis. For example: In this study, we evaluate two approaches to uncertainty quantification (historical vs. Bayesian) for tumor growth models. We hypothesize that a Bayesian approach, especially when augmented with auto-segmentation uncertainty and multi-model ensembling, will provide more useful, sharper yet acceptable uncertainty estimates for distinguishing tumor growth behavior. Present the various components (auto-segmentation uncertainty, multiple models) as sub-questions or strategies within that overall research goal. This restructuring of the manuscript will give the reader a clearer overview of the paper’s main theme and the actual scientific objectives that the authors would like to answer.

C) Methodology

1. Historical Method (Assumptions and Justification): The historical uncertainty method assumes that past prediction errors (residuals) are representative of future errors. While this approach is described (i.e., using percentiles of normalized errors by similar scenario), the manuscript does not discuss its potential limitations. For instance, pooling residuals from different tumors and time horizons assumes a stationary error distribution and ignores that errors might systematically depend on tumor size or growth rate despite volume normalization. It is also unclear how many comparable past errors are available for certain scenarios, such as long-term forecasts with few prior measurements. If this is sparse, the percentile-based interval may be unstable.


Suggestions for improvement: Add a brief rationale for why this historical error approach is expected to work, and acknowledge its limitations. For example, explain that it attempts to guarantee calibration by matching the empirical error distribution, but note that this may lead to overly conservative intervals (as indeed observed). If possible, include a reference to similar residual-based prediction interval methods in the literature to justify its use. Additionally, consider whether stratifying residuals by additional factors is needed (e.g., time between scans or absolute tumor size) to improve the representativeness. Clarifying these points will help readers trust that the historical method is implemented thoughtfully rather than as a black box.

2. Details of Implementation of the Bayesian methods: The Bayesian uncertainty estimation is said to attribute uncertainty to the model’s coefficients and uses Monte Carlo simulation to produce a distribution for the growth curve. However, important implementation details are left to Appendix A.3, which could be summarized in the main text. For instance, it is not explicitly stated by the authors how the coefficient uncertainty is quantified.  Are the priors assumed for model parameters, and is full Bayesian inference performed, or is this a bootstrap of the fit? The term Monte Carlo simulation suggests sampling from an estimated posterior or from parameter covariance, but I am a bit unsure of this. Moreover, the text refers to transforming a single growth prediction into a probability distribution, which is conceptually ambiguous.


Suggestions for improvement: Provide a concise description of how the Bayesian prediction intervals are obtained. For example: We estimated the posterior distribution of the growth model parameters given the data (using [method X, e.g. MCMC or assumption of asymptotic normality of parameters]), then drew many samples of parameters to simulate future tumor growth, thereby constructing a predictive distribution for future volume. This clarifies the process for the construction of Bayesian prediction intervals. Additionally, please explicitly state that the intervals are the Bayesian credible intervals to distinguish them from frequentist confidence intervals. This is important for terminological accuracy and consistency, since these intervals represent predictive credibility under the model’s assumptions. Such clarification will make the methodology employed in this research more transparent and reproducible by future researchers.

3. Auto-Segmentation Uncertainty Integration: The approach to incorporate segmentation uncertainty is novel but needs more justification and possibly refinement. The authors treated the segmentation output as a normally distributed volume with a certain standard deviation (from an ensemble of five models). Two methods were used: tumor-specific (SD calculated per tumor from the five models’ outputs) and general (a global average SD as percentage of volume). A concern is whether assuming a Gaussian distribution for volume error is appropriate since volume errors might follow a skewed distribution or are proportional to the size of the tumor in a non-linear fashion. In contrast, the general method which utilizes an average percentage error is a rather crude approximation that ignores the fact that smaller tumors might have relatively larger fractional segmentation errors than large tumors (or the opposite might be true as well).

Suggestions for improvement: Provide the rationale for choosing a Gaussian distribution to model volume uncertainty or consider alternative distributions (for instance, log-normal distribution if volume errors scale multiplicatively). If possible, corroborate this by citing studies that treat segmentation uncertainties statistically.  For example, Eaton-Rosen and colleagues showed that leveraging structure-wise segmentation uncertainty can improve the confidence intervals of volume estimates [4]. Additionally, the manuscript should explain why a fixed percentage error was chosen for the general method by answering whether there is evidence that relative segmentation error is approximately constant across tumor volumes? If not, the general method could be refined by stratifying the analysis by tumor size or by using a Bayesian hierarchical model to borrow strength across tumors (borrowing strength paradigm in the Bayesian framework through Bayesian hierarchical modelling). Overall, a clearer discussion of these uncertainty models will assist the readers in understanding the efforts to address imperfections in segmentation.

4. Combining Multiple Growth Models: The methodology introduces a custom rule to combine the three growth models’ predictions for risk classification. While the approach (using min or max probabilities when models agree on high/low risk, and averaging otherwise) is plausible, it comes across as ad hoc. The thresholds of 20% and 80% used in this rule were not explicitly justified. Are these tied to the risk cutoffs of 5% and 90%? If so, the connection is not pellucid. Moreover, by taking the lowest chance when all models predict high growth risk (>80%), the method is intentionally conservative, but it could also under-predict risk if one model is an outlier.

Suggestions for improvement: Provide justification or evidence for this combination rule. For example, was it decided based on preliminary data or chosen to maximize some performance metric? If available, report how this combined approach compares to a simple average in terms of calibration or accuracy. In addition, consider referencing ensemble uncertainty techniques, since combining models to account for model uncertainty is conceptually akin to ensemble methods, which are known to improve reliability. The authors could also cite research on deep ensembles showing that averaging predictions yields better-calibrated uncertainties than single models. A more principled alternative might be Bayesian model averaging (BMA), which would weight each growth model’s prediction by its posterior probability.  Thus, discussing this as an option would strengthen the methodological rigor. At minimum, clarifying the rationale for this combination rule will reassure readers and the editor that it is grounded in sound logic rather than arbitrary choices.

5. Evaluation Metrics choice between Sharpness vs Calibration Metrics: The methodology rightly emphasizes evaluating uncertainty via sharpness and calibration. However, one point of potential confusion is how these two important classes of evaluation metrics were balanced when evaluating uncertainty. The text mentions that some previous research proposes offering multiple solutions with different trade-offs, which the authors dismiss as impractical. This is the primary philosophical choice:  ultimate prioritization of a single uncertainty estimate per prediction. It would be helpful to readers if the authors explained their decision on how to achieve an acceptable balance between calibration and sharpness. For example, did they target approximately 80% empirical coverage for a 95% interval as a compromise, or did they strictly aim for the nominal 95% coverage? Currently, the Bayesian method ended up under-calibrated (78% coverage for the 95% Bayesian credible interval), thus favoring sharpness, whereas the historical method was over-calibrated (94% coverage, but the intervals obtained were very wide).

Suggestions for improvement: Elaborate on the intended balance between calibration and sharpness. If the Bayesian intervals were too narrow due to under-calibration, elaborate whether any post-calibration procedure was attempted. In fact, there are simple post-hoc calibration methods for predictive intervals (for instance, isotonic regression on predicted quantiles) that can guarantee correct coverage. For instance, Kuleshov et al. describe a straightforward procedure to adjust model-uncertainty outputs to achieve nominal coverage. Incorporating such a calibration step, or at least discussing it, would strengthen the methodology. Furthermore, the inclusion of a calibration step will demonstrate the authors’ awareness that improving the Bayesian method’s calibration is feasible without fully reverting to overly wide historical intervals. In summary, clarifying this section will convey to readers that the authors have carefully considered how to evaluate and tune the uncertainty estimates, rather than simply computing metrics without proper justification.

D) Results

1. Reporting of Calibration and Sharpness: The results section presents figures for interval width (sharpness, Fig. 3) and coverage (calibration, Fig. 4), but the description of the information presented by the figures in the manuscript text could be made more lucid and accurate. For instance, readers would benefit from knowing the overall calibration performance in plain numbers. The abstract mentioned the Bayesian 95% credible intervals contained true values 78% of the time compared to 94% for the historical method, but in the main text of the Results section, this information was only alluded to via Figure 4 (“inverse relation to width” and darker colors indicating lower fractions). Not stating the numeric coverage can downplay how under-calibrated the Bayesian method was.

Suggestions for improvement: The authors should explicitly state the calibration outcomes in the text. For example: Across all predictions, the historical method’s 95% intervals contained the true volume approximately 94% of the time, which is close to the nominal level, whereas the Bayesian method’s 95% credible intervals contained the true volume only approximately 78% of the time, indicating model’s under-calibration. This numerical summary would make the trade-off justifiably concrete. Additionally, the authors may consider presenting numerical tables or values for the overall calibration error and the median interval width for each method. A small summary table or a calibration plot (for instance, observed versus expected capture rates) could effectively complement Figures 3 and 4. This helps the reader quickly grasp the magnitude or extent of differences between methods.

2. Statistical Significance of Differences: The authors indicate that no relevant differences were observed between the three different growth models in terms of uncertainty width. However, the manuscript does not indicate whether any statistical tests were performed to compare methods or models. For example, when comparing the Bayesian tumour-specific vs general uncertainty, there were differences in low-risk classification (8.2% vs 15.3% growth in the low-risk category), are these differences statistically significant or just observed? Similarly, no relevant differences between growth models suggest that a comparison was made, but the authors should explain on what basis this statement was made and confirmed (i.e., via visual inspection, or a test of interval widths or prediction errors?).

3. Handling of Edge Cases (Single-measurement Predictions): The results highlight that predictions based on only a single tumor volume measurement yielded extremely wide intervals (often larger than the volume itself) regardless of method. In fact, Table 2 shows that the majority of predictions, especially those unclassified in risk, come from cases with 1 or 2 measurements. This is an important finding; essentially, with very limited data, neither method can produce a useful forecast. However, the manuscript stops short of discussing the implication that, early in follow-up, the predictions are too uncertain to be considered reliable.

Suggestions for improvement: The authors should emphasize and discuss this point in the Results or Discussion. For example, note in Results: As expected, using only one prior measurement yields very large uncertainty intervals (often >100% of the predicted volume), making those forecasts of limited practical value. This can transition into a recommendation that at least two data points are needed to meaningfully predict growth, an insight that could be valuable for clinicians designing follow-up schedules. In the Discussion, the authors might suggest that alternative strategies are needed for the first imaging time-point, perhaps using population priors or waiting for a second measurement before drawing conclusions. Acknowledging this limitation in the results will show that the authors recognize when their predictive approach is or is not applicable or useful, which enhances the credibility of the study.

4. Figures and Examples: Figures 5 and 6 present individual examples of a growing tumor and a stable tumor, illustrating how the Bayesian prediction and its uncertainty could have altered follow-up decisions. These are useful for giving the reader an intuitive sense of the utility property of the model. However, ensure that these figures are sufficiently explained. Currently, the text says Figure 5 shows a case of a clearly growing tumor… Figure 6… a stable volume… these scenarios, the risk classifications can guide follow-up. There is an implicit suggestion that in Figure 5 that the model would flag earlier intervention, and in Figure 6, the model would suggest less frequent imaging. It would be useful to explicitly state the model’s risk classification in each case (presumably high risk for Figure 5 and low risk for Figure 6) and whether the actual outcomes aligned with the classifications.

Suggestions for improvement: In the figure captions or main text, clearly describe each example’s data and model output. For instance, In Fig. 5, after observing two volume measurements, the Bayesian model (with tumor-specific uncertainty) predicts a high probability of >20% growth by next follow-up (high-risk classification), which in retrospect would have been correct as the tumor indeed showed significant growth. In Fig. 6, the model predicts a stable tumor (low-risk classification), matching the observed negligible growth. Additionally, consider adding a visual element to these figures, such as shading the predicted 95% interval over time and marking when intervention could be recommended. The goal is to ensure readers can fully understand what Figures 5 and 6 demonstrate without ambiguity.

E) Discussion

1. Interpretation of Calibration vs. Sharpness: The discussion notes that the historical method yielded very wide intervals not useful in clinical practice, whereas the Bayesian methods produced narrower intervals but were less favorable with regard to calibration. The explanation given is that the historical method prioritizes calibration by setting its estimation for the confidence intervals to match the distribution, but relevant specifics of the problem... are not taken into account, whereas the Bayesian method uses problem-specific details to individualize intervals. This explanation is on the right track, but could be phrased more clearly. As written, the sentence is a bit convoluted (perhaps due to line breaks) and readers might not follow the reasoning.

Suggestions for improvement: Refine the discussion of the calibration-sharpness trade-off in plainer terms. For example: The historical method essentially ‘calibrates by construction’, it chooses interval bounds so that ~95% of past outcomes fell inside, but as a result it ignores case-specific information (such as how much data we have or how the tumor is growing). This leads to very broad intervals in all cases. In contrast, the Bayesian method tailors the interval to each tumor’s data, which makes them narrower (more informative) but not as perfectly calibrated, in our results, only approximately 78% of true values fell in the nominal 95% interval. This kind of explanation explicitly ties the methods to their outcomes. Additionally, acknowledging that perfect calibration and high sharpness are competing goals (and that one can formally adjust the Bayesian intervals if needed) would show a nuanced understanding. Clearing up the language and reasoning here will help readers appreciate why each method behaved as it did.

2. Clinical Significance of Risk Classification: A major selling point of the Bayesian approach (especially with tumor-specific uncertainty) is its superior discrimination between growing and non-growing tumors (86% correct classification compared to a much lower correct classification rate for the historical method). The discussion notes this as arguably the most practical value and likely very helpful in a clinical context. However, the discussion could better connect these results to clinical decision-making. For example, what are the implications of an 86% correct classification rate? Does this mean we can safely observe a tumor if classified low-risk (with only about 15% chance of growth in that group), and conversely intervene early for high-risk ones (which had around 88% chance of growth)? The authors should also caution that even 15% growth in the low-risk category is not negligible.

Suggestions for improvement: Expand on the clinical interpretation of the risk stratification results. For instance, Our Bayesian method could identify a subset of tumors with a very high probability of significant growth (more than 90% risk), which might warrant earlier intervention, and another subset with a very low growth risk (less than 5%), where extended surveillance intervals could be justified. However, we note that even in the low-risk group, about 15% of tumors did grow more than 20%, indicating that low-risk does not mean no-risk. Clinical decisions should therefore consider the predicted risk category alongside other factors. Including a statement about the need for prospective validation here is also recommended since it signals that, before adopting this model in practice, its risk predictions should be validated on an independent patient cohort. Tying the numerical results back to potential changes in follow-up frequency or treatment triggers makes the discussion more meaningful for clinicians reading an ORL/ENT journal.

3. Auto-Segmentation Uncertainty (Limitations and Future Work): The discussion acknowledges that the volume uncertainty estimates from the segmentation were not directly validated and that they only showed a slight improvement in risk classification. It also correctly notes other sources of measurement variation (e.g., MRI geometric distortions) that were not included. These points could be elaborated. For example, how might one further develop and evaluate the auto-segmentation uncertainty? One approach is to compare the model’s volume uncertainty against ground-truth variability, for instance, test-retest scans or expert manual segmentations, to observe whether the predicted distribution covers the true volume. Additionally, the authors could also discuss whether incorporating those other sources of variability, such as scanner-related uncertainties, might significantly change the results.

Suggestions for improvement: In the future work section, propose concrete steps to refine the handling of segmentation uncertainty. For instance: Future research should validate the auto-segmentation uncertainty estimates by comparing against known variability, such as repeat scans or inter-observer variation in manual segmentation. If the current normal-distribution assumption underestimates or mischaracterizes true volume uncertainty, more sophisticated models (e.g., quantile regression on segmentation ensembles or calibration of the uncertainty estimates) could be employed. Additionally, incorporating imaging-related uncertainties, for instance, MRI geometric distortion and slice thickness effects, is an obvious challenge since these could further widen predictive intervals if accounted for. Exploring these factors would make our uncertainty estimates more comprehensive. By providing these suggestions, the authors demonstrate awareness of the limitations in their methods and chart a path forward for future research, a critical component of a strong Discussion section.

4. Handling of Atypical Growth Patterns: The authors rightly note that certain rare tumor behaviors were not captured by any of the three models. Specifically, tumors that decrease in volume (around 1% of cases) or tumors that remain stable for a long period, then start growing later. They justify excluding the shrinking-tumor scenario from modeling due to its rarity. This is acceptable, but the discussion around it could be slightly expanded. For example, for the 1% of tumors that shrank, did the authors observe if those were perhaps measurement errors or true regressions in tumor size? A sentence or two about potential reasons could be insightful. Similarly, the pattern of stable then growth might indicate a latent phase followed by progression. Hence, should this be modeled as a piecewise process or a changepoint? The authors suggest it could be measurement error, but if this is real, it might need a different modeling approach (for instance, a two-phase growth model or a random changepoint model).

Suggestions for improvement: Acknowledge these rare patterns as limitations in model generality and propose how future studies might address them. For example: Our models assume monotonic growth, which fails to capture the rare instances of tumor regression or delayed growth re-acceleration. While true biological regression is uncommon in paragangliomas, when it occurs, it might relate to factors that are not considered in our model (for example due to infarction of the tumor). Capturing such phenomena would likely require substantially more data or dedicated case studies. Similarly, a tumor that stays stable for years before growing might violate the model assumptions. Hence, a possible future modeling approach to handle this could be a hierarchical or changepoint model that allows for a dormant phase followed by growth. Distinguishing real delayed growth from artifacts will require further longitudinal data. This addition shows readers that the authors have thoughtfully considered the extreme cases and how one might extend the modeling framework to handle them.

5. Bayesian Methodology (Further Refinements): In the latter part of the discussion, the authors mention an interesting nuance: that their Bayesian approach attributes uncertainty mainly to uncertainty in the coefficients and that this is not the same as re-fitting growth models on resampled volumes. This essentially points out that they did not fully propagate measurement uncertainty through the model-fitting process. The authors note that while both approaches (their method in comparison to repeated refitting) may produce similar margins of uncertainty, the final shape of the growth curves... can differ and they leave investigating this to future work. This is a subtle but important point for statistical readers. It might leave some wondering whether the authors should have pursued the full refit approach and whether it would have materially changed any conclusions.

Suggestions for improvement: Strengthen this part by conveying why the chosen approach was reasonable and how much difference might arise from the alternative. For example: Our Bayesian prediction intervals were generated by sampling parameter uncertainties conditional on the initial measured volumes. An alternative is a two-stage sampling: perturb the volume measurements according to their uncertainty, refit the model, and then predict. We expect both approaches to yield similar uncertainty widths (and indeed our experiment hints at that), but they could lead to different central predictions for individual patients. In practice, the difference manifested in slightly altered curve shapes for a few cases, although the overall risk classifications might not change. A full investigation of this refit-resample approach is warranted as future work to ensure no clinically significant discrepancies. By adding this perspective, the authors show they are aware of more rigorous Bayesian techniques (e.g., hierarchical models that integrate measurement error) but chose a simpler approach for practicality. It also signals to informed readers that the authors recognize the importance of proper uncertainty propagation and will address it moving forward.

F) Conclusion

1. Scope of Conclusions: The conclusion currently states that the Bayesian method performed best in distinguishing or classifying growing and non-growing tumors in the context considered, and therefore is the preferred method. This is a fair summary of the findings. However, it might be slightly too broad-sounding without a reminder of the specific context. The phrase in the context considered is too ambiguous without taking into consideration the actual context such as the retrospective single-center data and specific models used. It would be prudent for the authors to explicitly note any constraints on generalizability of the study findings. For instance, how might this result differ in other tumor types or if different growth models were used? The authors don’t need to speculate too far, but a cautionary note would balance the enthusiastic recommendation.

Suggestions for improvement: Tone down the conclusion to ensure it is not too broad and over-generalized. For example: In our retrospective analysis of head and neck paragangliomas, the Bayesian uncertainty approach (with tumor-specific segmentation error) provided the most useful predictions for growth, outperforming a historical error-based method in terms of identifying likely growth vs stability. This suggests that a Bayesian framework is preferable for this problem. Adding a sentence like Future prospective validation will be important to confirm these findings before clinical implementation would also be wise. This makes clear that while the findings are useful and solid, they remain context-specific and preliminary relative to actual clinical use. Including such a caveat aligns with the norms of academic writing, where researchers avoid definitive statements without external validation.

2. Recommendations for Practice or Next Steps: The manuscript’s conclusion could be enhanced by a forward-looking statement. Currently, it focuses only on summarizing the preferred method. Given that this manuscript is potentially attractive to an ORL/ENT audience, the authors might take the opportunity to suggest how these findings could impact patient management once validated. For example, if the Bayesian method were integrated into a clinical workflow, how might follow-up protocols change? Also, since the study is a methods comparison, the conclusion might outline the next research step. For instance, implementing this in a prospective trial or integrating it into a decision-support tool.

Suggestions for improvement: Conclude with a brief insight into implications and future work. For instance: These results lay the groundwork for more personalized surveillance strategies in paraganglioma patients. A logical next step will be to test the Bayesian prediction model on a prospective patient cohort or in a simulation of clinical decision-making, to assess whether using these uncertainty estimates can safely reduce imaging frequency or prompt earlier interventions when needed. Ultimately, by quantifying prediction uncertainty, clinicians can have greater confidence in differentiating indolent tumors from those that will progress. Such a statement provides a satisfying closure by highlighting the potential real-world benefit and the path toward achieving it. It leaves the reader with an understanding that the research is moving toward improving patient care, not just a theoretical exercise.

G) References Consulted for This Report Preparation (no article published by the reviewer was consulted or recommended to the authors for citation promotion):

1. Graham NJ, Smith JD, Else T, Basura GJ. Paragangliomas of the head and neck: a contemporary review. Endocr Oncol. 2022;2(1):R153-R162. doi:10.1530/EO-22-0080

2. Ovadia Y, Fertig E, Ren J, et al. Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift. Adv Neural Inf Process Syst. 2019;32:13991-14002

3. Kuleshov V, Fenner N, Ermon S. Accurate uncertainties for deep learning using calibrated regression. In: Proceedings of the 35th International Conference on Machine Learning (ICML); 2018. p. 2796-2804

4. Eaton-Rosen Z, Bragman F, Bisdas S, Ourselin S, Cardoso MJ. Towards safe deep learning: accurately quantifying biomarker uncertainty in neural network predictions. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); 2018. p. 691-699

5. Jungo A, Balsiger F, Reyes M. Analyzing the Quality and Challenges of Uncertainty Estimations for Brain Tumor Segmentation. Front Neurosci. 2020 Apr 8;14:282. doi: 10.3389/fnins.2020.00282.

6. Schweighofer K, Arnaiz-Rodriguez A, Hochreiter S, Oliver N. The disparate benefits of deep ensembles. arXiv [Preprint]. 2024 Oct 17. Available from: https://doi.org/10.48550/arXiv.2410.13831

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

I am more than happy to accept the manuscript in its present form. The authors have addressed all my comments satisfactorily.

Back to TopTop