Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations

Diagnostics 2026, 16(6), 943; https://doi.org/10.3390/diagnostics16060943

by M. A. Elsabagh¹

, Amira Samy Talaat²

, Dalia Elwi³, Shaimaa M. Hassan^4,5

, Sameer Alqassimi⁶

and Esraa Hassan^1,*

Reviewer 1: Anonymous

Reviewer 2:

Yogita Dubey

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Diagnostics 2026, 16(6), 943; https://doi.org/10.3390/diagnostics16060943

Submission received: 29 December 2025 / Revised: 26 February 2026 / Accepted: 27 February 2026 / Published: 23 March 2026

(This article belongs to the Section Medical Imaging and Theranostics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Explicitly state, at the end of the Introduction, the specific clinical and methodological gap your framework addresses. Clearly list the primary objectives and hypotheses in a concise paragraph.
Focus the Introduction on why an integrated framework (agreement analysis + prediction + uncertainty quantification) is needed beyond existing studies. Ensure all claims are supported by up-to-date and directly relevant references.
Provide clearer justification for the choice of machine learning models, especially why linear regression was included and how it was fairly compared to more complex models.
Explicitly describe data splitting strategy (training/testing), sample sizes per split, and any random seed usage. Clarify how observer variability was handled and how observer-specific measurements were aggregated.
Reduce redundancy between text, tables, and figures by summarizing key findings in the text and referring readers to tables/figures for details.
Clearly highlight clinically meaningful thresholds (e.g., what magnitude of measurement difference would change management).
More explicitly compare your findings with prior low-dose CT and AI-based studies, emphasizing what is genuinely new.
Discuss generalizability, including limitations related to dataset size, retrospective design, and scanner heterogeneity.
Provide concrete examples of how this framework could be implemented in routine active surveillance workflows. Specify which patient populations would benefit most from the proposed approach.

Author Response

Detailed Responses to the Editor’s and Reviewers' Comments

Title: “Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations"

Thank you very much for your effort and valuable time reviewing our manuscript and thank you for your constructive comments that helped to make our paper more impacting. We have carefully considered each comment point-by-point and responded. Please see below, in the table, our detailed response to comments. All page numbers refer to the revised manuscript file with tracked changes. The changes in the revised version are highlighted in yellow color.

Reviewer 1

Thank you very much for your valuable time reviewing our paper and thank you for your comments that helped make our paper more impactful. We have taken your comments carefully into consideration while processing our paper. Hereafter, the comments outlined of yours and how we have handled them in our paper.

- We appreciate the reviewer's helpful suggestion. In response, we changed the last paragraph of the Introduction to make it clear what clinical and methodological gaps this work addresses that have not yet been filled. We now clearly say that there isn't an integrated, uncertainty-aware framework that can look at agreement, predictive equivalence, and clinical reliability between low-dose and standard-dose CT measurements in active surveillance of small renal masses all at once. We have also added a short, well-organized paragraph that clearly lists the main goals of the study and the hypotheses that the proposed framework is based on. This revision makes things clearer, makes the clinical motivation stronger, and makes the Introduction fit better with the study's methodological contributions.

- We add the following paragraph at pages 3 and 4 at the end of introduction “Even though there is more and more evidence that low-dose CT can be used for active surveillance of small renal masses, there is still a big clinical and methodological gap: there is no single, uncertainty-aware framework that can measure inter-protocol agreement, predict measurement equivalence, and give confidence-aware decision support that is suitable for everyday clinical practice. Current studies generally focus on agreement analysis, predictive modeling, or uncertainty estimation separately, which reduces their applicability and clinical relevance. To fill this gap, the current study suggests a unified analytical framework that aims to assess the interchangeability of low-dose and standard-dose CT measurements while clearly accounting for uncertainty and observer variability. The main goals are: (i) to measure agreement, bias, and reliability between low-dose and standard-dose CT tumor measurements; (ii) to create and test predictive models that can accurately estimate standard-dose measurements from low-dose data; and (iii) to include uncertainty quantification to help doctors make decisions based on how sure they are. We hypothesize that low-dose CT measurements exhibit near-perfect concordance with standard-dose measurements, and that a streamlined, interpretable predictive model can attain superior accuracy without sacrificing reliability, thus facilitating safe, dose-optimized imaging strategies for the longitudinal monitoring of small renal masses.”

Lines(55-107)

1. Explicitly state, at the end of the Introduction, the specific clinical and methodological gap your framework addresses. Clearly list the primary objectives and hypotheses in a concise paragraph.

- We appreciate the reviewer's important suggestion. We have revised the Introduction to explain why an integrated framework is necessary. We did this by explaining why (i) agreement analysis, (ii) predictive equivalence modeling, and (iii) uncertainty quantification (UQ) should be used together instead of separately. In the revised text, we (1) clarify what existing low-dose CT surveillance studies establish (primarily interchangeability/measurement agreement, often with denoising) and what they typically do not provide (a predictive mapping with calibrated confidence suitable for longitudinal decision thresholds), (2) explicitly justify why agreement alone is insufficient for deployment across heterogeneous scanners/reconstruction and multi-observer measurement variability, and (3) ensure that each claim is supported with directly relevant and up-to-date references, including recent multiobserver renal-mass surveillance work and current reviews on low-dose CT methods and uncertainty estimation in medical imaging.

- We add the following paragraph at page 3 at the end of introduction “Even though low-dose CT protocols, which are often backed by iterative reconstruction or deep learning-based denoising, have shown good agreement with routine-dose CT for keeping an eye on small renal masses, the evidence base is still not very strong. Recent studies on active surveillance of renal masses have mainly focused on the reproducibility of measurements and agreement between protocols. These studies have shown that significant dose reduction can maintain tumor diameter assessment in controlled settings and with multiple observers. But agreement statistics by themselves don't give you a decision-support tool that you can use: they don't show you how to map low-dose measurements to routine-dose equivalents at the patient level, and they don't tell you how much confidence you need to interpret small changes over time that could lead to intervention thresholds. This limitation is more important in real life, where surveillance data are not all the same (different scanner models, acquisition parameters, reconstruction/denoising strategies, and reader variability), and where clinical decisions depend on whether size changes are bigger than expected measurement noise. In the same way, previous research on low-dose CT enhancement has mostly been about improving image quality and the accuracy of measurements through reconstruction or denoising. However, this does not automatically mean that quantitative measurements are equivalent across protocols clearly and reliable for long-term monitoring. Simultaneously, uncertainty quantification has advanced in medical imaging and machine learning; however, it is often regarded as a separate subject and is not consistently incorporated into dose-optimization workflows that also encompass agreement evidence and predictive translation across protocols. So, we need a single framework that (i) uses agreement and reliability statistics to show that measurements can be swapped, (ii) learns a clear predictive translation from low-dose to routine-dose measurements to help protocols work together, and (iii) gives calibrated uncertainty estimates (like prediction intervals/coverage) so that doctors can tell if differences are likely to be clinically significant or just normal variability. This combined approach directly supports confidence-aware decision-making in active surveillance, where the safety benefits of dose reduction must be balanced against the risk of misclassifying growth trajectories.”

Focus the Introduction on why an integrated framework (agreement analysis + prediction + uncertainty quantification) is needed beyond existing studies. Ensure all claims are supported by up-to-date and directly relevant references.

- We appreciate the reviewer's suggestion that we need to better explain why we chose and compare the machine learning models. In response, we have revised the Methods section to explicitly clarify the rationale for including linear regression alongside more complex nonlinear models. Linear regression was deliberately integrated as a clinically interpretable, succinct model to evaluate the hypothesis that the correlation between low-dose and standard-dose CT measurements is primarily linear, considering that both measurements seek to quantify the identical physical tumor dimension under varying noise conditions. We want to make it clear that all of the models were trained and tested under the same conditions, using the same feature sets, data splits, preprocessing steps, and evaluation metrics. This makes sure that the comparison is fair and unbiased. This revision makes it clear that the better performance of linear regression is due to the structure of the data, not the choice of model, and it supports the clinical usefulness of clear and computationally efficient models for imaging applications that need to be optimized for dose.

- We add the following paragraph at page 9 at the beginning of Section 2.3 (Machine Learning Algorithms & Architectural Innovation): “The choice of machine learning models was based on both methodological rigor and clinical usefulness. Linear regression was intentionally incorporated not solely as a baseline, but as a hypothesis-driven model to evaluate whether the correlation between low-dose and standard-dose CT measurements is inherently linear, considering that both seek to estimate the identical physical tumor diameter under diverse noise conditions. In clinical measurement translation tasks, parsimonious linear models are often preferred when they attain similar or greater accuracy, owing to their interpretability, robustness, and ease of implementation. To make sure the comparison was fair and not biased, all models (linear regression, random forest, gradient boosting, and support vector regression) were trained on the same observer-aware feature set, standardized preprocessing pipeline, identical train-test splits, and consistent performance metrics (R², MAE, RMSE, and MAPE). We chose hyperparameters based on best practices for each model class, and we used multi-split robustness analysis to double-check the model's performance. This standardized evaluation framework guarantees that any performance variations observed are indicative of authentic model–data compatibility, rather than inconsistencies in training or evaluation conditions.

- We add the following statement at page 21 at the end of discussion Section (Section 4.4. Clinical and Radiological Considerations) The discovery that linear regression surpassed more intricate nonlinear models indicates that the low-dose to standard-dose measurement relationship is primarily linear in this context, thereby endorsing the utilization of transparent models for clinically reliable dose-optimized imaging.

Provide clearer justification for the choice of machine learning models, especially why linear regression was included and how it was fairly compared to more complex models.

- We appreciate the reviewer's emphasis on the need for methodological transparency and reproducibility. In response, we have changed the Methodology section to make it clear how we split the data, including the training and testing proportions, the sample sizes for each split, and the use of a random seed. We also explain how observer-specific measurements were added through observer-aware feature engineering and how these measurements were combined to consider differences between observers. The new text makes it clear that observer variability was modeled directly instead of being averaged out. This makes sure that both agreement analysis and predictive modeling accurately reflect real-world clinical settings with multiple observers.

- We add the following paragraph at page 9 at the Section 2.2 (Overall Framework Design Section): “The dataset consisted of 40 paired cases, each containing various observer-specific tumor diameter measurements derived from both low-dose and standard-dose CT acquisitions. To make predictions, the data were split into training and testing sets with a fixed 75/25 ratio. This meant that 30 cases were used to train the model and 10 cases were kept for independent testing. A fixed random seed was used during data splitting and model initialization to make sure that the results could be reproduced. We tested the model's robustness by doing random train-test splits again, each time keeping the same proportion. We then reported the performance variability across the splits.

Observer variability was explicitly addressed through observer-aware feature engineering instead of relying solely on simple averaging. For each case and dose level, individual observer measurements were initially identified automatically, followed by the calculation of descriptive statistics—namely, the mean, standard deviation, and coefficient of variation—across observers. These combined features show both central tendency and how different observers see things differently, and they were used as inputs for the model. We used the full set of observer-specific measurements to do agreement and reliability analyses (CCC, ICC, Bland–Altman) to keep observer-level information. For predictive modeling, we used the aggregated observer-aware features to make case-level predictions that were stable. This method makes sure that observer variability is clearly modeled and included in both statistical and machine learning analyses, which is how clinical measurements are done in real life”

Explicitly describe data splitting strategy (training/testing), sample sizes per split, and any random seed usage. Clarify how observer variability was handled and how observer-specific measurements were aggregated.

- We appreciate the reviewer's helpful suggestion. In response, we changed the Results and Discussion sections to make the narrative text less like the tables and figures that go with it. The main text has been streamlined to focus on summarizing the key findings and their clinical relevance. It now directs readers to the appropriate tables and figures for more detailed quantitative information. This revision makes it easier to read, cuts down on repetition, and makes it clearer how to interpret the results without leaving out any important details.

- We add the following paragraph at page 12 at the Results Section: “The Results section focuses on the most important findings and clinical interpretation to make things clearer and avoid repeating information. The tables and figures show the full numerical results.”

- We replace “The CCC was 0.9930, indicating almost perfect agreement. The ICC for low-dose observers was 0.9642, and the ICC for normal-dose observers was 0.9654. Pearson correlation was 0.9933 and Spearman correlation was 0.9827. The mean absolute difference was 0.6760 mm and the root mean square difference was 0.9474 mm, as shown in Table 4.” With “Advanced agreement analysis showed almost perfect agreement between low-dose and normal-dose measurements, with very little systematic error and very high reliability across observers. Table 4 shows a summary of the detailed agreement metrics, such as CCC, ICC, correlation coefficients, and error statistics.” In Section 3.2 I(nter-Protocol Agreement Analysis) at page 13

Reduce redundancy between text, tables, and figures by summarizing key findings in the text and referring readers to tables/figures for details.

- We thank the reviewer for this important clinical perspective. In response, we have changed the manuscript to make it clear how the differences in measurements relate to clinically important thresholds used in the active surveillance of small renal masses. We now make it clear that management decisions are usually based on tumor growth that stays above several millimeters over a series of tests, not on changes that are less than a millimeter. We also make it clear how the differences in measurements and limits of agreement relate to these thresholds. This addition makes it easier for doctors to understand the results and explains why the differences between low-dose and standard-dose CT measurements are not likely to change how patients are treated.

- We add the following paragraph at page 21 at the Discussion Section 4.4 (Clinical and Radiological Considerations): “From a clinical standpoint, management decisions during active surveillance of small renal masses are influenced by persistent and reproducible alterations in tumor size rather than insignificant single-measurement fluctuations. Current urological practice and surveillance protocols generally consider tumor growth on the order of several millimeters over time—commonly ≥3–5 mm or a consistent growth rate across serial examinations—as potentially actionable, prompting closer follow-up or intervention rather than isolated sub-millimeter differences. In this context, the mean bias of −0.094 mm between low-dose and standard-dose CT measurements and the narrow 95% limits of agreement of about ±2 mm is both well below the levels that would be expected to influence clinical decision-making. Consequently, the minor measurement discrepancies noted between dose protocols are improbable to lead to misclassification of tumor growth trajectories or unsuitable alterations in patient management.

- We add the following paragraph at page 13 at section 3.2 (the Inter-Protocol Agreement Analysis): The observed measurement differences between low-dose and normal-dose CT were significantly lower than clinically relevant growth thresholds utilized in active surveillance protocols, thereby affirming the clinical interchangeability of the two imaging modalities.

Clearly highlight clinically meaningful thresholds (e.g., what magnitude of measurement difference would change management).

- We appreciate the reviewer's helpful suggestion. We have made the Discussion (and, where appropriate, the end of the Introduction) stronger so that our results are more clearly compared to previous low-dose CT and AI-based work. Specifically, we now: (i) differentiate between previous studies that primarily establish interchangeability through agreement/reproducibility and those that concentrate on image denoising/reconstruction, in contrast to our work, which amalgamates agreement analysis, predictive equivalence modeling, and uncertainty quantification into a cohesive, decision-support-oriented pipeline; (ii) directly compare reported agreement findings (e.g., multi-observer SRM surveillance under dose reduction, with/without deep-learning denoising) to our nearly perfect concordance; and (iii) emphasize that our principal innovation lies not only in enhanced accuracy but in the integrated framework that produces interpretable predictions with calibrated uncertainty and robustness assessment suitable for clinical surveillance workflows. We reference recent renal mass surveillance studies and contemporary uncertainty-estimation reviews to substantiate these assertions.

- We add the following paragraph at pages 19 and 20: “Previous research on low-dose CT for monitoring small renal masses has primarily concentrated on establishing measurement interchangeability between reduced-dose and standard-dose acquisitions in a multi-observer context. This includes findings that significant dose reduction can maintain size-based evaluation and that deep learning–based denoising may enhance low-dose monitoring without compromising clinical interpretability. These studies provide essential clinical reassurance regarding agreement and feasibility; however, they typically stop short of delivering an integrated, deployable framework that (i) quantifies agreement and observer reliability, (ii) learns an explicit predictive mapping from low-dose to routine-dose–equivalent measurements, and (iii) reports calibrated uncertainty to support confidence-aware interpretation of longitudinal changes. In contrast, our contribution is a single, uncertainty-aware pipeline that combines agreement statistics (CCC/ICC/Bland–Altman), multi-model prediction, and uncertainty quantification/robustness testing. This makes it possible to not only to check for interchangeability, but also to translate the results in a way that is clinically meaningful and includes clear confidence estimates. This integration is especially important for active surveillance, where management depends on finding sustained growth that goes beyond the expected range of measurement variability. This is because scanner heterogeneity and multi-reader variation can affect longitudinal consistency.

More explicitly compare your findings with prior low-dose CT and AI-based studies, emphasizing what is genuinely new.

- We appreciate the reviewer's emphasis on the significance of generalizability. In response, we have added more information to the Discussion section to clearly talk about the limitations of the dataset size, the fact that the analysis was done in the past, and the fact that the scanners were different. We elucidate the potential impact of these factors on external validity and describe how the utilization of a multi-scanner, multi-observer dataset and robustness analyses partially alleviate these issues. This discussion gives a fair evaluation of how well the framework can be used in different situations and suggests ways to test it in the future with multiple institutions.

- We add the following paragraph at pages 21 and 22 Section 4.5 (Limitations and Future Work): “Even though the results are promising, there are some problems with this study that should be noted. First, the study sample size was relatively small (n = 40 cases) because there aren't many publicly available datasets with paired normal-dose and low-dose CT acquisitions for measuring renal tumors by multiple observers. Although this sam-ple size aligns with previous multi-observer imaging studies and adequately demon-strates robust agreement and predictive performance, larger cohorts are required to evaluate scalability, subgroup performance, and infrequent tumor presentations. Second, the analysis depended on simulated low-dose CT images obtained from rou-tine-dose acquisitions instead of low-dose CT scans that were taken in advance. Simu-lation-based dose reduction is a widely accepted method for controlled methodological evaluation that enables direct comparison under identical anatomical conditions. However, it may not comprehensively account for all sources of variability present in real-world prospective low-dose imaging, including protocol-dependent noise charac-teristics and reconstruction discrepancies. To confirm clinical performance, it will be necessary to use true low-dose CT acquisitions for prospective validation. Third, validation utilized a single publicly accessible dataset (KiTS19), which, despite being multi-institutional and multi-scanner, may not encompass the complete diversi-ty of CT acquisition protocols, reconstruction techniques, and vendor-specific attrib-utes found in standard clinical practice. Consequently, the generalizability to alterna-tive datasets, imaging protocols, or novel reconstruction technologies cannot be com-pletely assured without further external validation. Finally, the linear regression model’s superiority may be since the current dataset showed a strong linear relationship between low-dose and normal-dose measurements. In datasets characterized by elevated noise levels, diverse reconstruction methodolo-gies, or intricate nonlinear measurement distortions, more sophisticated nonlinear models may deliver enhanced performance. Future research will investigate model adaptability across various datasets and evaluate whether the most effective modeling strategy should be customized for particular imaging conditions.”

Discuss generalizability, including limitations related to dataset size, retrospective design, and scanner heterogeneity.

- We appreciate the reviewer's clinically focused recommendation. In response, we have added more examples to the Discussion section to show how the proposed framework can be used in everyday active surveillance workflows, such as dose-optimized follow-up imaging and confidence-aware longitudinal tumor assessment. We additionally specify patient populations that are most likely to benefit from this approach, particularly those requiring repeated imaging over extended surveillance periods. These additions make it clearer how the framework can be used in real-life clinical decision-making.

- We add the following paragraph at pages 20 “The proposed framework can be used as a post-acquisition decision-support layer in regular active surveillance workflows. It works with standard radiological measurements and doesn't change how imaging or reporting is done. After a low-dose CT scan, observer-specific tumor diameter measurements can be entered into the framework. The framework then checks for agreement metrics, makes a predicted routine-dose–equivalent measurement, and gives a prediction interval that shows how uncertain the measurement is. This output helps radiologists and urologists tell the difference between normal changes in measurements and tumor growth that is clinically significant over time. For instance, if a low-dose follow-up scan shows a small increase in size, the framework can tell you if this change is within the expected range of uncertainty or if it is outside of the limits that require closer monitoring or action.”

Provide concrete examples of how this framework could be implemented in routine active surveillance workflows. Specify which patient populations would benefit most from the proposed approach.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors presented a framework for dose optimization in CT-based surveillance, with strong statistical validation and clinical relevance.

However, the figures appear to be AI-generated, which raises concerns regarding transparency, reproducibility, and scientific authenticity.

Please mention which figures were generated using AI tools, justify their use, and confirm that all visualizations accurately represent real experimental data and analyses.

Author Response

Detailed Responses to the Editor’s and Reviewers' Comments

Title: “Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations"

Reviewer 2

- We appreciate the reviewer's important points about openness and scientific honesty. We want to make it clear that AI tools were not used to make any figures that show experimental results, statistical analyses, or quantitative findings. All data-driven figures, such as agreement analyses, Bland–Altman plots, model performance plots, calibration curves, and robustness analyses, were made directly from the experimental data using standard scientific plotting libraries. They accurately show the analyses that were reported.

The framework overview and workflow diagrams in the Methods section are just conceptual examples meant to give a quick overview of the analytical pipeline. The authors manually redrew these schematic figures using standard graphics software to make them clearer and more consistent. They do not contain any fake data or AI-generated content. No AI tools were used to make, change, or improve experimental results or visualizations.

1. Explicitly state, at the end of the Introduction, the specific clinical and methodological gap your framework addresses. Clearly list the primary objectives and hypotheses in a concise paragraph.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This publication outlines an excellent study that is well-designed, clinically applicable study, includes statistical agreement analysis, machine learning prediction, and uncertainty quantification; all used to validate low-dose CT as a means of monitoring renal masses. As demonstrated, a simple linear regression is superior to overly complex models is an important and impressive addition to the field of interpretable artificial intelligence (AI) in radiology.

In addition to mentioning uncertainty quantification (UQ) as a primary area in the manuscript, there is no clear discussion on how uncertainty has been quantified. Please add a separate subsection (e.g., 3.3) describing:

a) UQ methodologies used.

b) How uncertainty intervals were generated and validated.

c) How UQ output will be used in clinical decision-making.

The manuscript is missing a section on limitations of the study. Thus, please include a subsection titled "Limitations” prior to "Conclusion," addressing:

a) Study sample size (n=40).

b) The use of simulated low-dose CT images compared to using low-dose CT scans that were received prospectively.

c) Validation based on only one dataset (KiTS19) and potential lack of generalizability using other protocols/scanners.

d) The possibility that superiority of the linear model might only exist for this dataset.

Table 5 indicates that N=10 was the number of patients used for testing but does not clarify if the number 10 refers to 10 patients or 10 images created from 10 patients. Please justify the number of patients used for tests and provide clarity on how such images were divided among sections for reporting purposes. It may also be beneficial, if possible, to provide results on cross-validation to enhance strength of study conclusions.

Terms such as "establishes new standards" could be considered overreaching; therefore use more tempered language (e.g., "demonstrates strong evidence for validation," "provides evidence to help with the uptake into clinical practice"). Use reasonable timelines and factual findings to support your conclusions/arguments.

Some figures and tables were incorrectly referred to (e.g., the use of reference "Table 6" but it does not exist in this particular excerpt).
Be sure to number and refer to each figure/table in numerical order within the body of your text.

Although Figure 10 depicts how a workflow of this type would appear, you have not adequately described potential real-world constraints (e.g., PACS integration, acceptance by radiologists, legal factors). Please add to your discussion section a short paragraph regarding difficulties that may arise/reasons that these types of workflows are slow to develop for clinical use along with suggested solutions to those difficulties.

Please, define all acronyms at first use (e.g., DLIR, MBIR, VI in UQ section).

Comments on the Quality of English Language

Some sentences are awkwardly phrased (e.g., “uncertainty-conscience framework” should likely be “uncertainty-aware framework”). Consider a final language polish.

Author Response

Detailed Responses to the Editor’s and Reviewers' Comments

Title: “Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations"

Reviewer 3

We appreciate the reviewer's helpful comment. In response, we have added a new Section 4.3 (Uncertainty and Clinical Decision-Making) at pages 19 and 20 to clearly explain the uncertainty quantification method used in this study. This new subsection explains (i) the exact UQ methods that were used, (ii) how prediction intervals were made and tested in real life, and (iii) how uncertainty estimates are meant to help doctors make decisions during active surveillance. This addition makes the methodology clearer and makes it clear what role uncertainty quantification plays in the proposed framework.

“Previous research on low-dose CT for monitoring small renal masses has primarily concentrated on establishing measurement interchangeability between reduced-dose and standard-dose acquisitions in a multi-observer context. This includes findings that significant dose reduction can maintain size-based evaluation and that deep learning–based denoising may enhance low-dose monitoring without compromising clinical interpretability. These studies provide essential clinical reassurance regarding agreement and feasibility; however, they typically stop short of delivering an integrated, deployable framework that (i) quantifies agreement and observer reliability, (ii) learns an explicit predictive mapping from low-dose to routine-dose–equivalent measurements, and (iii) reports calibrated uncertainty to support confidence-aware interpretation of longitudinal changes. In contrast, our contribution is a single, uncertain-ty-aware pipeline that combines agreement statistics (CCC/ICC/Bland–Altman), multimodal prediction, and uncertainty quantification/robustness testing. This makes it possible to not only to check for interchangeability, but also to translate the results in a way that is clinically meaningful and includes clear confidence estimates. This integration is especially important for active surveillance, where management depends on finding sustained growth that goes beyond the expected range of measurement variability. This is because scanner heterogeneity and multi-reader variation can affect longitudinal consistency.”

1. In addition to mentioning uncertainty quantification (UQ) as a primary area in the manuscript, there is no clear discussion on how uncertainty has been quantified. Please add a separate subsection (e.g., 3.3) describing:

a) UQ methodologies used.

b) How uncertainty intervals were generated and validated.

c) How UQ output will be used in clinical decision-making

- We thank the reviewer for this important suggestion. In response, we have added a dedicated subsection titled “Limitations” immediately prior to the Conclusions. This subsection explicitly addresses the modest sample size, the retrospective use of simulated low-dose CT images, reliance on a single public dataset (KiTS19), and the possibility that the observed superiority of the linear regression model may be dataset-specific. These additions provide a balanced assessment of the study’s scope and generalizability and outline directions for future prospective and multi-institutional validation.

- We add a subsection 4.5. (Limitations and Future Work) at pages 21 and 22 “Even though the results are promising, there are some problems with this study that should be noted. First, the study sample size was relatively small (n = 40 cases) because there aren't many publicly available datasets with paired normal-dose and low-dose CT acquisitions for measuring renal tumors by multiple observers. Although this sample size aligns with previous multi-observer imaging studies and adequately demonstrates robust agreement and predictive performance, larger cohorts are required to evaluate scalability, subgroup performance, and infrequent tumor presentations.

Second, the analysis depended on simulated low-dose CT images obtained from routine-dose acquisitions instead of low-dose CT scans that were taken in advance. Simulation-based dose reduction is a widely accepted method for controlled methodological evaluation that enables direct comparison under identical anatomical conditions. However, it may not comprehensively account for all sources of variability present in real-world prospective low-dose imaging, including protocol-dependent noise characteristics and reconstruction discrepancies. To confirm clinical performance, it will be necessary to use true low-dose CT acquisitions for prospective validation.

Third, validation utilized a single publicly accessible dataset (KiTS19), which, despite being multi-institutional and multi-scanner, may not encompass the complete diversity of CT acquisition protocols, reconstruction techniques, and vendor-specific attributes found in standard clinical practice. Consequently, the generalizability to alternative datasets, imaging protocols, or novel reconstruction technologies cannot be completely assured without further external validation.

Finally, the linear regression model's superiority may be since the current dataset showed a strong linear relationship between low-dose and normal-dose measurements. In datasets characterized by elevated noise levels, diverse reconstruction methodologies, or intricate nonlinear measurement distortions, more sophisticated nonlinear models may deliver enhanced performance. Future research will investigate model adaptability across various datasets and evaluate whether the most effective modeling strategy should be customized for imaging conditions.”

2. The manuscript is missing a section on limitations of the study. Thus, please include a subsection titled "Limitations” prior to "Conclusion," addressing:

a) Study sample size (n=40).

b) The use of simulated low-dose CT images compared to using low-dose CT scans that were received prospectively.

c) Validation based on only one dataset (KiTS19) and potential lack of generalizability using other protocols/scanners.

d) The possibility that superiority of the linear model might only exist for this dataset.

- We appreciate the reviewer's request for this important clarification. We changed the Methods and Results sections to make it clear that the test set was made up of 10 independent patients (cases), not individual images. All observer-specific measurements for each patient were combined at the case level before the model was evaluated. The data splitting strategy was made to keep patient-level independence and stop information from getting mixed up between the training and testing sets. We further explain why the test set size is appropriate given the small size of the multi-observer dataset, and we also explain how the measurements were organized for reporting. Furthermore, we now explicitly highlight the previously conducted and reported repeated random train–test split analysis, which functions as a cross-validation-style robustness evaluation and enhances the reliability of the study's conclusions.

- The following paragraph is added at page 9 “The dataset consisted of 40 paired cases, each containing various observer-specific tumor diameter measurements derived from both low-dose and standard-dose CT acquisitions. To make predictions, the data were split into training and testing sets with a fixed 75/25 ratio. This meant that 30 cases were used to train the model and 10 cases were kept for independent testing. A fixed random seed was used during data splitting and model initialization to make sure that the results could be reproduced. We tested the model's robustness by doing random train-test splits again, each time keeping the same proportion. We then reported the performance variability across the splits”

- The following paragraph is added at page 15 “All reported predictive performance metrics are calculated at the patient level, with observer-specific measurements consolidated into aggregated features for each case. During training, no measurements or images from test patients or individual observers were used”

- The following paragraph is added at page 16 “A repeated random train–test split analysis was done to make the results even more reliable. This was like a cross-validation test of robustness. The linear regression model consistently exhibited elevated predictive performance across various splits with distinct test sets, characterized by minimal variability (mean R² = 0.9833 ± 0.0119), signifying that the observed results are not contingent upon a singular data partition.”

3. Table 5 indicates that N=10 was the number of patients used for testing but does not clarify if the number 10 refers to 10 patients or 10 images created from 10 patients. Please justify the number of patients used for tests and provide clarity on how such images were divided among sections for reporting purposes. It may also be beneficial, if possible, to provide results on cross-validation to enhance strength of study conclusions.

- We appreciate the reviewer's helpful suggestion. In response, we changed the manuscript to use more measured and evidence-based language instead of language that could be seen as too broad. Statements suggesting the formulation of new standards have been moderated to accurately represent the study’s contribution, focusing on robust validation evidence and practical support for future clinical implementation rather than conclusive standard-setting. We have also made sure that the conclusions and translational implications are based on the results we presented and are set within reasonable timeframes for further validation and use.

4. Terms such as "establishes new standards" could be considered overreaching; therefore use more tempered language (e.g., "demonstrates strong evidence for validation," "provides evidence to help with the uptake into clinical practice"). Use reasonable timelines and factual findings to support your conclusions/arguments.

- Thank you to the reviewer for pointing out this problem. We have carefully gone over the entire manuscript to make sure that all of the tables and figures are numbered correctly and that the text always refers to them in numerical order. All wrong or out-of-date references, even those that point to tables or figures that don't exist, have been fixed, and the in-text citations have been updated to match the new numbering. This audit makes sure that the manuscript is clear, consistent, and has correct cross-references.

5. Some figures and tables were incorrectly referred to (e.g., the use of reference "Table 6" but it does not exist in this excerpt).
Be sure to number and refer to each figure/table in numerical order within the body of your text.

- We appreciate the reviewer's useful and clinically relevant comments. In response, we have added more information to the Discussion section to talk about specific problems that could make it hard to use these workflows in real life, such as issues with integrating PACS, getting clinicians to accept them, following the law, and concerns about workflow disruption. We also talk about possible ways to lessen the impact, focusing on how clear, understandable models, decision support after acquisition, and prospective validation can help make clinical adoption safe and gradual.

- We add the following paragraph at page 27 “Even though Figure 10 shows that these uncertainty-aware surveillance workflows could work in theory, there are several real-world problems that could make it harder for doctors to start using them. Integrating with current PACS and radiology information systems is still a practical problem because clinical settings often use different software systems and reporting pipelines that are hard to change. Also, radiologists and referring clinicians may not accept decision-support tools if they think they are unclear, get in the way of workflow, or haven't been tested enough. Legal and regulatory factors, such as liability for automated suggestions and adherence to medical device rules, also make people hesitant to adopt new technologies. To address these challenges, the proposed framework is deliberately structured as a post-acquisition, non-intrusive decision-support layer that utilizes standard radiological measurements instead of raw images, thereby reducing integration complexity. Using models that can be understood and reporting uncertainty clearly helps build trust and accountability among clinicians. Additionally, validating the models in a prospective, multi-institutional way and rolling them out in phases alongside routine practice may help them get regulatory approval and be gradually adopted into clinical workflows.”

6. Although Figure 10 depicts how a workflow of this type would appear, you have not adequately described potential real-world constraints (e.g., PACS integration, acceptance by radiologists, legal factors). Please add to your discussion section a short paragraph regarding difficulties that may arise/reasons that these types of workflows are slow to develop for clinical use along with suggested solutions to those difficulties.

- We appreciate the reviewer's important suggestion for the editorial. We carefully looked over the manuscript and made sure that all acronyms are clearly defined the first time they are used in the text. This includes DLIR, MBIR, and VI in the Uncertainty Quantification section. This change makes things clearer and makes sure that the journal's style rules are followed. In addition, we performed a global audit of acronyms throughout the manuscript to ensure that all abbreviations are defined at first use and used consistently thereafter

7. Please, define all acronyms at first use (e.g., DLIR, MBIR, VI in UQ section).

- We thank the reviewer for this helpful comment regarding language clarity. We have performed a careful, manuscript-wide language revision to improve phrasing, readability, and grammatical accuracy. Awkward or imprecise expressions—including the phrase “uncertainty-conscience framework”—have been corrected (e.g., to “uncertainty-aware framework”), and similar instances were revised throughout the text. This final language polish improves clarity while preserving the scientific meaning and technical accuracy of the manuscript. The revised manuscript was carefully reviewed for clarity and consistency by the authors to ensure a polished academic style suitable for publication.

8. Some sentences are awkwardly phrased (e.g., “uncertainty-conscience framework” should likely be “uncertainty-aware framework”). Consider a final language polish. (Comments on the Quality of English Language)

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript deals with a timely topic, however only generic information were provided about the methodology used to analyze CT images, without detailed algorithmic and IT details (including e.g. links to program code or existing software) that could allow other researchers to replicate the study findings.

Furthermore, from my understanding it appears that no direct comparison was made against a control group of other models applied to the same CT datasets using inferential statistical methods. Instead, the performance of the proposed framework was measured in absolute terms and compared descriptively with literature findings, preventing a rigorous and scientifically meaningful evaluation of its actual advantages over existing technologies.

Minor comments:

1) The manuscript is structured more like a dissertation thesis than as an original research paper. I suggest that, for better clarity and readability, it should be converted into a structure format (such as Background, Materials and methods, Results and Conclusions), starting from the Abstract. The 'Related work' section should be removed and its main findings condensed in the revised Discussion section, which should be specifically focused on interpreting and critically comparing the study findings with those from the literature.

2) The introduction is very long and should be shortened by about 50%, reducing the (correct) emphasis on radiation dose reduction and relevance of CT imaging in the workup of renal masses, while further stressing the importance of measurement accuracy (also in terms of inter-reader variability between different radiologists, CT protocols and CT scanners) and of explainability of AI models.

3) Line 257. Please remove the words: 'when analytical rigor is applied'.

4) Results, lines 429-431. How were selected the 40 patients of the final group that fulfilled the inclusion criteria? Consecutively or randomly? Who did patient selection?

5) Results, lines 433-434. Please report the degree of experience (in terms of years after board certification for radiologists and years of residency course for residents) of the six radiologist observers.

Author Response

Detailed Responses to the Editor’s and Reviewers' Comments

Title: “Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations"

Reviewer 4 Thank you very much for your valuable time reviewing our paper and thank you for your comments that helped make our paper more impactful. We have taken your comments carefully into consideration while processing our paper. Hereafter, the comments outlined of yours and how we have handled them in our paper.
- We sincerely thank the reviewer for this detailed and thoughtful assessment and for acknowledging the timeliness of the topic.	This manuscript deals with a timely topic, however only generic information were provided about the methodology used to analyze CT images, without detailed algorithmic and IT details (including e.g. links to program code or existing software) that could allow other researchers to replicate the study findings. Furthermore, from my understanding it appears that no direct comparison was made against a control group of other models applied to the same CT datasets using inferential statistical methods. Instead, the performance of the proposed framework was measured in absolute terms and compared descriptively with literature findings, preventing a rigorous and scientifically meaningful evaluation of its actual advantages over existing technologies.
We agree with the reviewer’s assessment and have performed a comprehensive restructuring of the manuscript. The following changes have been implemented: 1. Structured Abstract: The abstract has been reformatted into a structured layout with specific headers: Background, Methods, Results, and Conclusions, providing a clear summary of the study’s core components. 2. Standardized Research Format: The paper now follows the conventional IMRAD (Introduction, Materials and Methods, Results, and Discussion) structure. o The Introduction now ends with a clear statement of the study’s goals and hypotheses. o Materials and Methods (Section 2) now include the dataset description, the technical framework, and mathematical formulations, ensuring all "how-to" information is consolidated. o Results (Section 3) is now strictly limited to data findings, including all descriptive statistics, agreement metrics, machine learning performance, and uncertainty validation tables/figures. 3. Removal of 'Related Work' Section: As suggested, the standalone "Related Work" section has been removed. The literature review and findings from previous studies (e.g., work by Zhang et al., Morimoto et al., and Chen et al.) have been moved to the Discussion (Section 4). 4. Enhanced Discussion: The Discussion section has been rewritten to focus on a critical comparison between our findings and existing literature. We have used our results (such as the CCC of 0.9930 and the superiority of Linear Regression) to directly compare benchmarks from the integrated related work, fulfilling the requirement for a critical interpretation of the study’s contributions. 5. Preservation of Technical Detail: While the structure has been condensed for better flow, we have ensured that all essential equations, figures, and tables from the original document were preserved in their appropriate new sections.	1) The manuscript is structured more like a dissertation thesis than as an original research paper. I suggest that, for better clarity and readability, it should be converted into a structure format (such as Background, Materials and methods, Results and Conclusions), starting from the Abstract. The 'Related work' section should be removed and its main findings condensed in the revised Discussion section, which should be specifically focused on interpreting and critically comparing the study findings with those from the literature.
We thank the reviewer for this constructive suggestion. We have significantly revised the Introduction to improve conciseness and focus: 1. Length Reduction: The overall length of the Introduction has been reduced by approximately 50%, removing general background information regarding the relevance of CT and broad radiation risks. 2. Increased Focus on Accuracy and Variability: We have pivoted the text to more directly address the technical challenges of measurement accuracy. The revised text now specifically highlights the issues of inter-reader variability, inconsistent scanner technologies, and the impact of varied CT protocols on longitudinal surveillance. 3. Stress on Explainability: We have further emphasized the clinical requirement for explainable and interpretable AI models, arguing that transparency is essential for medical decision-making in active surveillance workflows. 4. Preservation of Key Framework Goals: While shortening the section, we carefully preserved the core "gap analysis" and "hypotheses" (previously highlighted in yellow) as requested by the overall review process to ensure the methodology remains clearly motivated.	2) The introduction is very long and should be shortened by about 50%, reducing the (correct) emphasis on radiation dose reduction and relevance of CT imaging in the workup of renal masses, while further stressing the importance of measurement accuracy (also in terms of inter-reader variability between different radiologists, CT protocols and CT scanners) and of explainability of AI models.
- We thank the reviewer for this suggestion. The requested phrase has been removed from the manuscript to improve clarity and tone.	3) Line 257. Please remove the words: 'when analytical rigor is applied'.
We thank the reviewer for this important request for clarification. We have updated the Materials and Methods section (specifically Section 2.1: Dataset Description) to include these details: 1. Selection Method: The 40 patients were selected consecutively from the larger KiTS19 cohort based on their fulfillment of the predefined inclusion criteria (tumor size ≤4 cm, adequate late-arterial enhancement, and non-infiltrative morphology). 2. Responsibility: The selection process was performed by the study investigators during the retrospective data preparation phase. The revised text in Section 2.1 now reads: "A final group of 40 patients was selected consecutively by the study investigators from the KiTS19 cohort based on tumor size (≤4 cm for most cases), adequate late-arterial enhancement, and non-infiltrative tumor morphology..." We have highlighted this addition in the revised manuscript.	4) Results, lines 429-431. How were selected the 40 patients of the final group that fulfilled the inclusion criteria? Consecutively or randomly? Who did patient selection?
We thank the reviewer for this request to further characterize the expertise of the observers involved in the dataset. We have updated the Materials and Methods section (Section 2.1: Dataset Description) to include these specific details. Based on the documentation provided in the original validation study of this publicly available dataset (Borgbjerg et al., 2025 [32]), the experience levels were as follows: 1. Radiologists: Four board-certified abdominal radiologists with 5 to 10 years of experience following their board certification. 2. Residents: Two radiology residents in their third and fourth years of residency training (senior residents). The manuscript now explicitly states: "As documented in the original validation study of this dataset [13], the observer group consisted of four board-certified abdominal radiologists with 5 to 10 years of experience after board certification and two radiology residents in their third and fourth years of residency training." We believe this clarification confirms the high quality of the observer-provided "ground truth" measurements used in our framework.	5) Results, lines 433-434. Please report the degree of experience (in terms of years after board certification for radiologists and years of residency course for residents) of the six radiologist observers.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised manuscript demonstrates substantial improvement.

Author Response

We greatly appreciate the reviewer’s encouraging comments.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors, manuscript has been improved after revision. but Please pay attention some figure captions are long and can be slightly shortened for brevity. Also there are a few repetitive phrases in the discussion could be streamlined.

Author Response

We sincerely thank the reviewer for the positive assessment and for acknowledging the improvements made in the revised manuscript.

We appreciate the valuable suggestions regarding the figure captions and the repetitive phrases in the discussion section. In response, we have carefully reviewed and shortened the figure captions to enhance brevity and clarity. Additionally, we have revised the discussion section to eliminate repetitive expressions and streamline the text for improved readability and coherence.

We believe these modifications have further strengthened the overall quality of the manuscript.

Thank you again for your constructive feedback.

Reviewer 4 Report

Comments and Suggestions for Authors

Thank you for your reply. The revised manuscript is much better than the original version.

Author Response

We sincerely thank the reviewer for the encouraging and positive feedback. We are pleased to know that the revised manuscript is much improved compared to the original version. Your constructive comments have been invaluable in helping us enhance the clarity and overall quality of the work.

Article Menu

Uncertainty-Aware Framework for CT Radiation Dose Optimization in the Active Surveillance of Small Renal Masses: Clinical and Radiological Considerations

Further Information

Guidelines

MDPI Initiatives

Follow MDPI