Review Reports
- Lorenzo Spagnoli,
- Silvia Strolin and
- Miriam Santoro
- et al.
Reviewer 1: Dan-Alexandru Szabo Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- A retrospective dataset of 839 VMAT plans was used to extract 4,470 dosiomic features from five automatic isodose contours. The recommendation is to specify precisely what VMAT stands for: volumetric modulated arc therapy (VMAT).
- It is recommended that the introduction part should be much better grounded from a theoretical point of view.
- At the end of the introduction, authors must outline the purpose of the paper and specify precisely what novel elements this study presents, as well as its place within the specialised scientific literature.
- After these changes have been made, it is recommended that authors specify the structure of the paper/how it is divided in the last sentence of the introduction.
- The training process has been very time-consuming, mostly due to the lack of ad-hoc computational resources. Both hybrid models required from 30 minutes to 4 hours for each training epoch and reached convergence to a suitable performance on the validation dataset after about 150 epochs. For this reason, a more in-depth search of optimal architecture and hyper-parameters was not possible and will be the object of future works. It is recommended that this sentence be rephrased and included in Chapter Limitations of the study and future research directions.
- Discussions should begin by specifying the paper's purpose and indicating whether it has been accomplished or not.
- To the authors knowledge this is the first study that created a model based on the 3D absorbed dose distribution and that evaluated the performance in differentiating between plans with and without Flattening Filter. Please include this sentence in the introduction as one of the novel elements of the study.
- Given the journal's prestige, please remove or replace the following outdated bibliographical sources from the study: 3.
Moderate revisions to the English language are required.
Author Response
Comments and Suggestions for Authors
A retrospective dataset of 839 VMAT plans was used to extract 4,470 dosiomic features from five automatic isodose contours. The recommendation is to specify precisely what VMAT stands for: volumetric modulated arc therapy (VMAT).
R: We have specified the meaning of the acronym VMAT as suggested..
It is recommended that the introduction part should be much better grounded from a theoretical point of view.
At the end of the introduction, authors must outline the purpose of the paper and specify precisely what novel elements this study presents, as well as its place within the specialised scientific literature.
R: The introduction has been improved as suggested
After these changes have been made, it is recommended that authors specify the structure of the paper/how it is divided in the last sentence of the introduction.
The training process has been very time-consuming, mostly due to the lack of ad-hoc computational resources. Both hybrid models required from 30 minutes to 4 hours for each training epoch and reached convergence to a suitable performance on the validation dataset after about 150 epochs. For this reason, a more in-depth search of optimal architecture and hyper-parameters was not possible and will be the object of future works. It is recommended that this sentence be rephrased and included in Chapter Limitations of the study and future research directions.
Discussions should begin by specifying the paper's purpose and indicating whether it has been accomplished or not.
R: The discussion has been modified according to the reviewer’s suggestions (lines 428-430)
To the authors knowledge this is the first study that created a model based on the 3D absorbed dose distribution and that evaluated the performance in differentiating between plans with and without Flattening Filter. (line 78-82) Please include this sentence in the introduction as one of the novel elements of the study.
R: We have included the sentence in the introduction as suggested by the reviewer (lines 78-80)
Given the journal's prestige, please remove or replace the following outdated bibliographical sources from the study: 3.
Comments on the Quality of English Language
Moderate revisions to the English language are required.
R: The text has been improved as suggested-
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study employs deep learning to analyze three-dimensional spatial dose distributions and integrates engineered dosimetric features to develop a hybrid model that predicts whether radiotherapy plans require patient-specific quality assurance (PSQA). Trained on clinical data and prospectively validated, the approach enables more precise triage of cases that truly need verification, thereby conserving linear accelerator time and staff effort while improving workflow safety and efficiency. Overall, the research is practically oriented. To further strengthen the manuscript and advance the work, the following recommendations are offered:
- The manuscript presents confusion matrices and workload estimates for FF and FFF scenarios (“QA needed” vs. “not needed”). Please incorporate a unified cost–benefit and decision-curve analysis that balances gains from correctly identifying truly necessary QA against the costs of additional unnecessary QA. Provide subgroup analyses by disease site to help medical physics teams select an optimal operating point under different thresholds.
- The model is trained using γ passing rate ≤97% (near the cohort median) to detect plans deviating from typical performance, whereas clinical decision-making often follows TG-218 with γ3%/2 mm >95% as the action threshold. Consider elevating the 95% threshold to a co-primary analysis and complement it with decision curves/net-benefit assessment as well as equivalence/non-inferiority perspectives to clarify clinical relevance.
- Dose volumes are cropped/resampled to a 27-cm cubic field with ~3-mm voxels. While computationally convenient, this preprocessing constrains the assessable volume and voxel heterogeneity, risking failures in large-volume or multi-lesion cases with wide spatial separation. Please quantify the coverage limitations imposed by preprocessing and discuss performance risks in such edge cases.
- The current 80/20 split with per-epoch random re-partitioning (re-sampling train/test each epoch) may inflate generalization if multiple plans from the same patient—or plans with near-identical parameters—appear across splits. Adopt fixed, patient- (or course-) level partitions with a strict hold-out set, and add a year-wise hold-out to test robustness under machine drift and workflow changes.
- The authors note that each epoch takes 30 minutes to 4 hours, with ~150 epochs to converge, limiting architectural and hyperparameter exploration. We recommend a compact Bayesian optimization or hyperparameter search using surrogate tasks or dimension-reduced inputs, and reporting its impact on clinically meaningful endpoints.
Author Response
This study employs deep learning to analyze three-dimensional spatial dose distributions and integrates engineered dosimetric features to develop a hybrid model that predicts whether radiotherapy plans require patient-specific quality assurance (PSQA). Trained on clinical data and prospectively validated, the approach enables more precise triage of cases that truly need verification, thereby conserving linear accelerator time and staff effort while improving workflow safety and efficiency. Overall, the research is practically oriented. To further strengthen the manuscript and advance the work, the following recommendations are offered:
- The manuscript presents confusion matrices and workload estimates for FF and FFF scenarios (“QA needed” vs. “not needed”). Please incorporate a unified cost–benefit and decision-curve analysis that balances gains from correctly identifying truly necessary QA against the costs of additional unnecessary QA. Provide subgroup analyses by disease site to help medical physics teams select an optimal operating point under different thresholds.
R: We thank the Reviewer for the relevant and interesting suggestion and have implemented a decision curve analysis for both models while separating FFF and FF plans as well as regardless of the use of filtering. The following figures are related to the analysis.
Regarding the subdivision by disease type we have some follow-up considerations to make.
Firstly, we think that the most effective path to implementing such division would be to re-train the models adding a feature in the input to account for disease type, without this inclusion in training it would be an “unfair” evaluation of model performance.
In fact, the addition of such a feature is considered as one of the first future improvements of the model however, given the reduced time frame allowed in the review process and the lack of ad-hoc computational resources make it impossible to perform such analysis at this step in the reviewing phase.
2.The model is trained using γ passing rate ≤97% (near the cohort median) to detect plans deviating from typical performance, whereas clinical decision-making often follows TG-218 with γ3%/2 mm >95% as the action threshold. Consider elevating the 95% threshold to a co-primary analysis and complement it with decision curves/net-benefit assessment as well as equivalence/non-inferiority perspectives to clarify clinical relevance.
R: The decision curve analysis has been performed using both 97 and 95 as decision thresholds. Using the 95% threshold, the physics team thinks that a reasonable threshold range would be around 10-20% (indicating that performing 5-10 QA to find 1 failure plan would be an acceptable endeavor). Considering this choice and evaluating the problem from a net reduction in intervention standpoint, hybrid 2 model is a slight improvement over hybrid-1, both being superior to “test none approach” and leading to a net reduction in performed QA procedures from 25% to 40%.
When considering only plans with the use of flattening filter, the reduction in unnecessary procedures ranges only from 20% to 30%, making the “test none” strategy advantageous for threshold probabilities over 15%. Considering only the FFF plans the net reduction in interventions is much more evident and would range from 30% to 45%. Figures 8 and 9 have been added.
3.Dose volumes are cropped/resampled to a 27-cm cubic field with ~3-mm voxels. While computationally convenient, this preprocessing constrains the assessable volume and voxel heterogeneity, risking failures in large-volume or multi-lesion cases with wide spatial separation. Please quantify the coverage limitations imposed by preprocessing and discuss performance risks in such edge cases.
R: We acknowledge the Reviewer’s concern, which we had also identified as a potential limitation during the original analysis. In our institution, the use of ~3 mm voxels represents the default clinical standard and is therefore the most adopted for both planning and evaluation purposes. This resolution provides a practical balance between spatial accuracy and computational efficiency in the majority of our radiotherapy treatments.
We recognize, however, that specific scenarios, such as stereotactic treatments or cases involving very small or spatially distant lesions, a smaller voxel size or a larger field of view may be warranted to ensure adequate coverage and preserve spatial heterogeneity. In such instances, preprocessing parameters can be adjusted accordingly to prevent underrepresentation of peripheral or multiple targets. A quantitative evaluation of the potential coverage limitations is being considered for future work, aiming to characterize the impact of resampling on dose–volume accuracy and model performance in these edge cases.
4.The current 80/20 split with per-epoch random re-partitioning (re-sampling train/test each epoch) may inflate generalization if multiple plans from the same patient—or plans with near-identical parameters—appear across splits. Adopt fixed, patient- (or course-) level partitions with a strict hold-out set, and add a year-wise hold-out to test robustness under machine drift and workflow changes.
R: We thank the Reviewer for these valuable suggestions. We fully acknowledge that per-epoch random re-partitioning may introduce bias, particularly in cases where multiple plans from the same patient or with similar parameters are included across training and test sets. Implementing a fixed, patient- or course-level partitioning strategy with a strict hold-out set, as well as a year-wise validation to assess robustness under potential machine or workflow drift, would indeed provide a more rigorous evaluation of generalization.
At this stage, re-training the neural network from scratch with these updated data splits would require computational resources currently allocated to other projects and is therefore not immediately feasible. Nonetheless, we consider this an important methodological improvement and plan to incorporate it in the next iteration of the study as part of our ongoing development efforts.
5.The authors note that each epoch takes 30 minutes to 4 hours, with ~150 epochs to converge, limiting architectural and hyperparameter exploration. We recommend a compact Bayesian optimization or hyperparameter search using surrogate tasks or dimension-reduced inputs, and reporting its impact on clinically meaningful endpoints.
R: We thank the Reviewer for these valuable suggestions. At this stage, even a reduced hyperparameter search followed by full retraining of the neural network would require computational resources that are currently not available. Therefore, this additional analysis cannot be performed within the time frame allowed for the revision. Nevertheless, we fully acknowledge the relevance of this point, as reported in the discussion, and it will be carefully considered for future developments and extensions of the project.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis study proposes a deep learning-based optimization method for patient-specific quality assurance (PSQA) processes in radiotherapy. By combining dosiomic features of the dose distribution and a 3D convolutional neural network, the aim is to improve both the efficiency and sensitivity of the PSQA process.
1. Although the article proposes a "hybrid" model (dosiomic features + 3D CNN), it lacks an in-depth mechanistic analysis of the complementarity between the two types of features, their respective contributions, and the fusion mechanism. It is recommended to supplement the discussion on the physical and biological significance of dosiomic features and 3D dose distributions in PSQA discrimination, clarifying why their combination can enhance model performance.
2. The article only compares two custom hybrid structures, lacking a systematic comparison with mainstream 2D/3D CNNs and traditional machine learning methods (such as random forests, SVM, etc.). It is suggested to include comparative experiments with baseline models to clearly demonstrate the advantages and limitations of the proposed method.
Author Response
This study proposes a deep learning-based optimization method for patient-specific quality assurance (PSQA) processes in radiotherapy. By combining dosiomic features of the dose distribution and a 3D convolutional neural network, the aim is to improve both the efficiency and sensitivity of the PSQA process.
- Although the article proposes a "hybrid" model (dosiomic features + 3D CNN), it lacks an in-depth mechanistic analysis of the complementarity between the two types of features, their respective contributions, and the fusion mechanism. It is recommended to supplement the discussion on the physical and biological significance of dosiomic features and 3D dose distributions in PSQA discrimination, clarifying why their combination can enhance model performance.
R: We appreciate the suggestion. Since one other Reviewer has asked to have a more complete introduction we have added these considerations in the introduction, from lines 60 to 73. We have included a brief explanation of dosiomic features, redirecting to the seminal paper for more in-depth details, as well as the reasoning that led us to use a hybrid approach in designing the networks.
- The article only compares two custom hybrid structures, lacking a systematic comparison with mainstream 2D/3D CNNs and traditional machine learning methods (such as random forests, SVM, etc.). It is suggested to include comparative experiments with baseline models to clearly demonstrate the advantages and limitations of the proposed method.
R: We thank the Reviewer for the suggestion, we have added in the supplementary material the results for a conventional CNN used on 3D dose distribution and a fully connected neural network using only dosiomic features as input. In figures S3 and S4 it is shown that both structures alone are worse than the two hybrid structures with differences in true positive rates ranging from 8% to 20%.
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsIt can be accepted now.